http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_insert.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_insert.html b/docs/build/html/topics/impala_insert.html new file mode 100644 index 0000000..557ab70 --- /dev/null +++ b/docs/build/html/topics/impala_insert.html @@ -0,0 +1,798 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="insert"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>INSERT Statement</title></head><body id="insert"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">INSERT Statement</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Impala supports inserting into tables and partitions that you create with the Impala <code class="ph codeph">CREATE + TABLE</code> statement, or pre-defined tables and partitions created through Hive. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + +<pre class="pre codeblock"><code>[<var class="keyword varname">with_clause</var>] +INSERT { INTO | OVERWRITE } [TABLE] <var class="keyword varname">table_name</var> + [(<var class="keyword varname">column_list</var>)] + [ PARTITION (<var class="keyword varname">partition_clause</var>)] +{ + [<var class="keyword varname">hint_clause</var>] <var class="keyword varname">select_statement</var> + | VALUES (<var class="keyword varname">value</var> [, <var class="keyword varname">value</var> ...]) [, (<var class="keyword varname">value</var> [, <var class="keyword varname">value</var> ...]) ...] +} + +partition_clause ::= <var class="keyword varname">col_name</var> [= <var class="keyword varname">constant</var>] [, <var class="keyword varname">col_name</var> [= <var class="keyword varname">constant</var>] ...] + +hint_clause ::= [SHUFFLE] | [NOSHUFFLE] (Note: the square brackets are part of the syntax.) +</code></pre> + + <p class="p"> + <strong class="ph b">Appending or replacing (INTO and OVERWRITE clauses):</strong> + </p> + + <p class="p"> + The <code class="ph codeph">INSERT INTO</code> syntax appends data to a table. The existing data files are left as-is, and + the inserted data is put into one or more new data files. + </p> + + <p class="p"> + The <code class="ph codeph">INSERT OVERWRITE</code> syntax replaces the data in a table. + + + Currently, the overwritten data files are deleted immediately; they do not go through the HDFS trash + mechanism. + </p> + + <p class="p"> + <strong class="ph b">Complex type considerations:</strong> + </p> + + <p class="p"> + The <code class="ph codeph">INSERT</code> statement currently does not support writing data files + containing complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>). + To prepare Parquet data for such tables, you generate the data files outside Impala and then + use <code class="ph codeph">LOAD DATA</code> or <code class="ph codeph">CREATE EXTERNAL TABLE</code> to associate those + data files with the table. Currently, such tables must use the Parquet file format. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details about working with complex types. + </p> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + + <p class="p"> + Currently, the <code class="ph codeph">INSERT OVERWRITE</code> syntax cannot be used with Kudu tables. + </p> + + <p class="p"> + Kudu tables require a unique primary key for each row. If an <code class="ph codeph">INSERT</code> + statement attempts to insert a row with the same values for the primary key columns + as an existing row, that row is discarded and the insert operation continues. + When rows are discarded due to duplicate primary keys, the statement finishes + with a warning, not an error. (This is a change from early releases of Kudu + where the default was to return in error in such cases, and the syntax + <code class="ph codeph">INSERT IGNORE</code> was required to make the statement succeed. + The <code class="ph codeph">IGNORE</code> clause is no longer part of the <code class="ph codeph">INSERT</code> + syntax.) + </p> + + <p class="p"> + For situations where you prefer to replace rows with duplicate primary key values, + rather than discarding the new data, you can use the <code class="ph codeph">UPSERT</code> + statement instead of <code class="ph codeph">INSERT</code>. <code class="ph codeph">UPSERT</code> inserts + rows that are entirely new, and for rows that match an existing primary key in the + table, the non-primary-key columns are updated to reflect the values in the + <span class="q">"upserted"</span> data. + </p> + + <p class="p"> + If you really want to store new rows, not replace existing ones, but cannot do so + because of the primary key uniqueness constraint, consider recreating the table + with additional columns included in the primary key. + </p> + + <p class="p"> + See <a class="xref" href="impala_kudu.html#impala_kudu">Using Impala to Query Kudu Tables</a> for more details about using Impala with Kudu. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + Impala currently supports: + </p> + + <ul class="ul"> + <li class="li"> + Copy data from another table using <code class="ph codeph">SELECT</code> query. In Impala 1.2.1 and higher, you can + combine <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">INSERT</code> operations into a single step with the + <code class="ph codeph">CREATE TABLE AS SELECT</code> syntax, which bypasses the actual <code class="ph codeph">INSERT</code> keyword. + </li> + + <li class="li"> + An optional <a class="xref" href="impala_with.html#with"><code class="ph codeph">WITH</code> clause</a> before the + <code class="ph codeph">INSERT</code> keyword, to define a subquery referenced in the <code class="ph codeph">SELECT</code> portion. + </li> + + <li class="li"> + Create one or more new rows using constant expressions through <code class="ph codeph">VALUES</code> clause. (The + <code class="ph codeph">VALUES</code> clause was added in Impala 1.0.1.) + </li> + + <li class="li"> + <p class="p"> + By default, the first column of each newly inserted row goes into the first column of the table, the + second column into the second column, and so on. + </p> + <p class="p"> + You can also specify the columns to be inserted, an arbitrarily ordered subset of the columns in the + destination table, by specifying a column list immediately after the name of the destination table. This + feature lets you adjust the inserted columns to match the layout of a <code class="ph codeph">SELECT</code> statement, + rather than the other way around. (This feature was added in Impala 1.1.) + </p> + <p class="p"> + The number of columns mentioned in the column list (known as the <span class="q">"column permutation"</span>) must match + the number of columns in the <code class="ph codeph">SELECT</code> list or the <code class="ph codeph">VALUES</code> tuples. The + order of columns in the column permutation can be different than in the underlying table, and the columns + of each input row are reordered to match. If the number of columns in the column permutation is less than + in the destination table, all unmentioned columns are set to <code class="ph codeph">NULL</code>. + </p> + </li> + + <li class="li"> + <p class="p"> + For a partitioned table, the optional <code class="ph codeph">PARTITION</code> clause identifies which partition or + partitions the new values go into. If a partition key column is given a constant value such as + <code class="ph codeph">PARTITION (year=2012)</code> or <code class="ph codeph">PARTITION (year=2012, month=2)</code>, all the + inserted rows use those same values for those partition key columns and you omit any corresponding + columns in the source table from the <code class="ph codeph">SELECT</code> list. This form is known as <span class="q">"static + partitioning"</span>. + </p> + <p class="p"> + If a partition key column is mentioned but not assigned a value, such as in <code class="ph codeph">PARTITION (year, + region)</code> (both columns unassigned) or <code class="ph codeph">PARTITION(year, region='CA')</code> + (<code class="ph codeph">year</code> column unassigned), the unassigned columns are filled in with the final columns of + the <code class="ph codeph">SELECT</code> list. In this case, the number of columns in the <code class="ph codeph">SELECT</code> list + must equal the number of columns in the column permutation plus the number of partition key columns not + assigned a constant value. This form is known as <span class="q">"dynamic partitioning"</span>. + </p> + <p class="p"> + See <a class="xref" href="impala_partitioning.html#partition_static_dynamic">Static and Dynamic Partitioning Clauses</a> for examples and performance + characteristics of static and dynamic partitioned inserts. + </p> + </li> + + <li class="li"> + An optional hint clause immediately before the <code class="ph codeph">SELECT</code> keyword, to fine-tune the behavior + when doing an <code class="ph codeph">INSERT ... SELECT</code> operation into partitioned Parquet tables. The hint + keywords are <code class="ph codeph">[SHUFFLE]</code> and <code class="ph codeph">[NOSHUFFLE]</code>, including the square brackets. + Inserting into partitioned Parquet tables can be a resource-intensive operation because it potentially + involves many files being written to HDFS simultaneously, and separate + <span class="ph">large</span> memory buffers being allocated to buffer the data for each + partition. For usage details, see <a class="xref" href="impala_parquet.html#parquet_etl">Loading Data into Parquet Tables</a>. + </li> + </ul> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <ul class="ul"> + <li class="li"> + Insert commands that partition or add files result in changes to Hive metadata. Because Impala uses Hive + metadata, such changes may necessitate a metadata refresh. For more information, see the + <a class="xref" href="impala_refresh.html#refresh">REFRESH</a> function. + </li> + + <li class="li"> + Currently, Impala can only insert data into tables that use the text and Parquet formats. For other file + formats, insert the data using Hive and use Impala to query it. + </li> + + <li class="li"> + As an alternative to the <code class="ph codeph">INSERT</code> statement, if you have existing data files elsewhere in + HDFS, the <code class="ph codeph">LOAD DATA</code> statement can move those files into a table. This statement works + with tables of any file format. + </li> + </ul> + </div> + + <p class="p"> + <strong class="ph b">Statement type:</strong> DML (but still affected by + <a class="xref" href="../shared/../topics/impala_sync_ddl.html#sync_ddl">SYNC_DDL</a> query option) + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + When you insert the results of an expression, particularly of a built-in function call, into a small numeric + column such as <code class="ph codeph">INT</code>, <code class="ph codeph">SMALLINT</code>, <code class="ph codeph">TINYINT</code>, or + <code class="ph codeph">FLOAT</code>, you might need to use a <code class="ph codeph">CAST()</code> expression to coerce values into the + appropriate type. Impala does not automatically convert from a larger type to a smaller one. For example, to + insert cosine values into a <code class="ph codeph">FLOAT</code> column, write <code class="ph codeph">CAST(COS(angle) AS FLOAT)</code> + in the <code class="ph codeph">INSERT</code> statement to make the conversion explicit. + </p> + + <p class="p"> + <strong class="ph b">File format considerations:</strong> + </p> + + <p class="p"> + Because Impala can read certain file formats that it cannot write, + the <code class="ph codeph">INSERT</code> statement does not work for all kinds of + Impala tables. See <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a> + for details about what file formats are supported by the + <code class="ph codeph">INSERT</code> statement. + </p> + + <p class="p"> + Any <code class="ph codeph">INSERT</code> statement for a Parquet table requires enough free space in the HDFS filesystem + to write one block. Because Parquet data files use a block size of 1 GB by default, an + <code class="ph codeph">INSERT</code> might fail (even for a very small amount of data) if your HDFS is running low on + space. + </p> + + <p class="p"> + If you connect to different Impala nodes within an <span class="keyword cmdname">impala-shell</span> session for + load-balancing purposes, you can enable the <code class="ph codeph">SYNC_DDL</code> query option to make each DDL + statement wait before returning, until the new or changed metadata has been received by all the Impala + nodes. See <a class="xref" href="../shared/../topics/impala_sync_ddl.html#sync_ddl">SYNC_DDL Query Option</a> for details. + </p> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + After adding or replacing data in a table used in performance-critical queries, issue a <code class="ph codeph">COMPUTE + STATS</code> statement to make sure all statistics are up-to-date. Consider updating statistics for a + table after any <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, or <code class="ph codeph">CREATE TABLE AS + SELECT</code> statement in Impala, or after loading data through Hive and doing a <code class="ph codeph">REFRESH + <var class="keyword varname">table_name</var></code> in Impala. This technique is especially important for tables that + are very large, used in join queries, or both. + </div> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + The following example sets up new tables with the same definition as the <code class="ph codeph">TAB1</code> table from the + <a class="xref" href="impala_tutorial.html#tutorial">Tutorial</a> section, using different file + formats, and demonstrates inserting data into the tables created with the <code class="ph codeph">STORED AS TEXTFILE</code> + and <code class="ph codeph">STORED AS PARQUET</code> clauses: + </p> + +<pre class="pre codeblock"><code>CREATE DATABASE IF NOT EXISTS file_formats; +USE file_formats; + +DROP TABLE IF EXISTS text_table; +CREATE TABLE text_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS TEXTFILE; + +DROP TABLE IF EXISTS parquet_table; +CREATE TABLE parquet_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS PARQUET;</code></pre> + + <p class="p"> + With the <code class="ph codeph">INSERT INTO TABLE</code> syntax, each new set of inserted rows is appended to any existing + data in the table. This is how you would record small amounts of data that arrive continuously, or ingest new + batches of data alongside the existing data. For example, after running 2 <code class="ph codeph">INSERT INTO TABLE</code> + statements with 5 rows each, the table contains 10 rows total: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.41s + +[localhost:21000] > insert into table text_table select * from default.tab1; +Inserted 5 rows in 0.46s + +[localhost:21000] > select count(*) from text_table; ++----------+ +| count(*) | ++----------+ +| 10 | ++----------+ +Returned 1 row(s) in 0.26s</code></pre> + + <p class="p"> + With the <code class="ph codeph">INSERT OVERWRITE TABLE</code> syntax, each new set of inserted rows replaces any existing + data in the table. This is how you load data to query in a data warehousing scenario where you analyze just + the data for a particular day, quarter, and so on, discarding the previous data each time. You might keep the + entire set of data in one raw table, and transfer and transform certain rows into a more compact and + efficient form to perform intensive analysis on that subset. + </p> + + <p class="p"> + For example, here we insert 5 rows into a table using the <code class="ph codeph">INSERT INTO</code> clause, then replace + the data by inserting 3 rows with the <code class="ph codeph">INSERT OVERWRITE</code> clause. Afterward, the table only + contains the 3 rows from the final <code class="ph codeph">INSERT</code> statement. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > insert into table parquet_table select * from default.tab1; +Inserted 5 rows in 0.35s + +[localhost:21000] > insert overwrite table parquet_table select * from default.tab1 limit 3; +Inserted 3 rows in 0.43s +[localhost:21000] > select count(*) from parquet_table; ++----------+ +| count(*) | ++----------+ +| 3 | ++----------+ +Returned 1 row(s) in 0.43s</code></pre> + + <p class="p"> + The <code class="ph codeph"><a class="xref" href="impala_insert.html#values">VALUES</a></code> clause lets you insert one or more + rows by specifying constant values for all the columns. The number, types, and order of the expressions must + match the table definition. + </p> + + <div class="note note note_note" id="insert__insert_values_warning"><span class="note__title notetitle">Note:</span> + The <code class="ph codeph">INSERT ... VALUES</code> technique is not suitable for loading large quantities of data into + HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate + data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL + syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not + run scripts with thousands of <code class="ph codeph">INSERT ... VALUES</code> statements that insert a single row each + time. If you do run <code class="ph codeph">INSERT ... VALUES</code> operations to load data into a staging table as one + stage in an ETL pipeline, include multiple row values if possible within each <code class="ph codeph">VALUES</code> clause, + and use a separate database to make cleanup easier if the operation does produce many tiny files. + </div> + + <p class="p"> + The following example shows how to insert one row or multiple rows, with expressions of different types, + using literal values, expressions, and function return values: + </p> + +<pre class="pre codeblock"><code>create table val_test_1 (c1 int, c2 float, c3 string, c4 boolean, c5 timestamp); +insert into val_test_1 values (100, 99.9/10, 'abc', true, now()); +create table val_test_2 (id int, token string); +insert overwrite val_test_2 values (1, 'a'), (2, 'b'), (-1,'xyzzy');</code></pre> + + <p class="p"> + These examples show the type of <span class="q">"not implemented"</span> error that you see when attempting to insert data into + a table with a file format that Impala currently does not write to: + </p> + +<pre class="pre codeblock"><code>DROP TABLE IF EXISTS sequence_table; +CREATE TABLE sequence_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS SEQUENCEFILE; + +DROP TABLE IF EXISTS rc_table; +CREATE TABLE rc_table +( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) +STORED AS RCFILE; + +[localhost:21000] > insert into table rc_table select * from default.tab1; +Remote error +Backend 0:RC_FILE not implemented. + +[localhost:21000] > insert into table sequence_table select * from default.tab1; +Remote error +Backend 0:SEQUENCE_FILE not implemented. </code></pre> + + <p class="p"> + Inserting data into partitioned tables requires slightly different syntax that divides the partitioning + columns from the others: + </p> + +<pre class="pre codeblock"><code>create table t1 (i int) <strong class="ph b">partitioned by (x int, y string)</strong>; +-- Select an INT column from another table. +-- All inserted rows will have the same x and y values, as specified in the INSERT statement. +-- This technique of specifying all the partition key values is known as static partitioning. +insert into t1 <strong class="ph b">partition(x=10, y='a')</strong> select c1 from some_other_table; +-- Select two INT columns from another table. +-- All inserted rows will have the same y value, as specified in the INSERT statement. +-- Values from c2 go into t1.x. +-- Any partitioning columns whose value is not specified are filled in +-- from the columns specified last in the SELECT list. +-- This technique of omitting some partition key values is known as dynamic partitioning. +insert into t1 <strong class="ph b">partition(x, y='b')</strong> select c1, c2 from some_other_table; +-- Select an INT and a STRING column from another table. +-- All inserted rows will have the same x value, as specified in the INSERT statement. +-- Values from c3 go into t1.y. +insert into t1 <strong class="ph b">partition(x=20, y)</strong> select c1, c3 from some_other_table;</code></pre> + + <p class="p"> + The following examples show how you can copy the data in all the columns from one table to another, copy the + data from only some columns, or specify the columns in the select list in a different order than they + actually appear in the table: + </p> + +<pre class="pre codeblock"><code>-- Start with 2 identical tables. +create table t1 (c1 int, c2 int); +create table t2 like t1; + +-- If there is no () part after the destination table name, +-- all columns must be specified, either as * or by name. +insert into t2 select * from t1; +insert into t2 select c1, c2 from t1; + +-- With the () notation following the destination table name, +-- you can omit columns (all values for that column are NULL +-- in the destination table), and/or reorder the values +-- selected from the source table. This is the "column permutation" feature. +insert into t2 (c1) select c1 from t1; +insert into t2 (c2, c1) select c1, c2 from t1; + +-- The column names can be entirely different in the source and destination tables. +-- You can copy any columns, not just the corresponding ones, from the source table. +-- But the number and type of selected columns must match the columns mentioned in the () part. +alter table t2 replace columns (x int, y int); +insert into t2 (y) select c1 from t1; + +-- For partitioned tables, all the partitioning columns must be mentioned in the () column list +-- or a PARTITION clause; these columns cannot be defaulted to NULL. +create table pt1 (x int, y int) partitioned by (z int); +-- The values from c1 are copied into the column x in the new table, +-- all in the same partition based on a constant value for z. +-- The values of y in the new table are all NULL. +insert into pt1 (x) partition (z=5) select c1 from t1; +-- Again we omit the values for column y so they are all NULL. +-- The inserted x values can go into different partitions, based on +-- the different values inserted into the partitioning column z. +insert into pt1 (x,z) select x, z from t2; +</code></pre> + + <p class="p"> + <code class="ph codeph">SELECT *</code> for a partitioned table requires that all partition key columns in the source table + be declared as the last columns in the <code class="ph codeph">CREATE TABLE</code> statement. You still include a + <code class="ph codeph">PARTITION BY</code> clause listing all the partition key columns. These partition columns are + automatically mapped to the last columns from the <code class="ph codeph">SELECT *</code> list. + </p> + +<pre class="pre codeblock"><code>create table source (x int, y int, year int, month int, day int); +create table destination (x int, y int) partitioned by (year int, month int, day int); +...load some data into the unpartitioned source table... +-- Insert a single partition of data. +-- The SELECT * means you cannot specify partition (year=2014, month, day). +insert overwrite destination partition (year, month, day) select * from source where year=2014; +-- Insert the data for all year/month/day combinations. +insert overwrite destination partition (year, month, day) select * from source; + +-- If one of the partition columns is omitted from the source table, +-- then you can specify a specific value for that column in the PARTITION clause. +-- Here the source table holds only data from 2014, and so does not include a year column. +create table source_2014 (x int, y int, month, day); +...load some data into the unpartitioned source_2014 table... +insert overwrite destination partition (year=2014, month, day) select * from source_2014; +</code></pre> + + <p class="p"> + <strong class="ph b">Sorting considerations:</strong> Although you can specify an <code class="ph codeph">ORDER BY</code> clause in an + <code class="ph codeph">INSERT ... SELECT</code> statement, any <code class="ph codeph">ORDER BY</code> clause is ignored and the + results are not necessarily sorted. An <code class="ph codeph">INSERT ... SELECT</code> operation potentially creates + many different data files, prepared on different data nodes, and therefore the notion of the data being + stored in sorted order is impractical. + </p> + + <p class="p"> + <strong class="ph b">Concurrency considerations:</strong> Each <code class="ph codeph">INSERT</code> operation creates new data files with unique + names, so you can run multiple <code class="ph codeph">INSERT INTO</code> statements simultaneously without filename + conflicts. + + While data is being inserted into an Impala table, the data is staged temporarily in a subdirectory inside + the data directory; during this period, you cannot issue queries against that table in Hive. If an + <code class="ph codeph">INSERT</code> operation fails, the temporary data file and the subdirectory could be left behind in + the data directory. If so, remove the relevant subdirectory and any data files it contains manually, by + issuing an <code class="ph codeph">hdfs dfs -rm -r</code> command, specifying the full path of the work subdirectory, whose + name ends in <code class="ph codeph">_dir</code>. + </p> + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="insert__values"> + + <h2 class="title topictitle2" id="ariaid-title2">VALUES Clause</h2> + + <div class="body conbody"> + + <p class="p"> + The <code class="ph codeph">VALUES</code> clause is a general-purpose way to specify the columns of one or more rows, + typically within an <code class="ph codeph"><a class="xref" href="impala_insert.html#insert">INSERT</a></code> statement. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + The <code class="ph codeph">INSERT ... VALUES</code> technique is not suitable for loading large quantities of data into + HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate + data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL + syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do + not run scripts with thousands of <code class="ph codeph">INSERT ... VALUES</code> statements that insert a single row + each time. If you do run <code class="ph codeph">INSERT ... VALUES</code> operations to load data into a staging table as + one stage in an ETL pipeline, include multiple row values if possible within each <code class="ph codeph">VALUES</code> + clause, and use a separate database to make cleanup easier if the operation does produce many tiny files. + </div> + + <p class="p"> + The following examples illustrate: + </p> + + <ul class="ul"> + <li class="li"> + How to insert a single row using a <code class="ph codeph">VALUES</code> clause. + </li> + + <li class="li"> + How to insert multiple rows using a <code class="ph codeph">VALUES</code> clause. + </li> + + <li class="li"> + How the row or rows from a <code class="ph codeph">VALUES</code> clause can be appended to a table through + <code class="ph codeph">INSERT INTO</code>, or replace the contents of the table through <code class="ph codeph">INSERT + OVERWRITE</code>. + </li> + + <li class="li"> + How the entries in a <code class="ph codeph">VALUES</code> clause can be literals, function results, or any other kind + of expression. See <a class="xref" href="impala_literals.html#literals">Literals</a> for the notation to use for literal + values, especially <a class="xref" href="impala_literals.html#string_literals">String Literals</a> for quoting and escaping + conventions for strings. See <a class="xref" href="impala_operators.html#operators">SQL Operators</a> and + <a class="xref" href="impala_functions.html#builtins">Impala Built-In Functions</a> for other things you can include in expressions with the + <code class="ph codeph">VALUES</code> clause. + </li> + </ul> + +<pre class="pre codeblock"><code>[localhost:21000] > describe val_example; +Query: describe val_example +Query finished, fetching results ... ++-------+---------+---------+ +| name | type | comment | ++-------+---------+---------+ +| id | int | | +| col_1 | boolean | | +| col_2 | double | | ++-------+---------+---------+ + +[localhost:21000] > insert into val_example values (1,true,100.0); +Inserted 1 rows in 0.30s +[localhost:21000] > select * from val_example; ++----+-------+-------+ +| id | col_1 | col_2 | ++----+-------+-------+ +| 1 | true | 100 | ++----+-------+-------+ + +[localhost:21000] > insert overwrite val_example values (10,false,pow(2,5)), (50,true,10/3); +Inserted 2 rows in 0.16s +[localhost:21000] > select * from val_example; ++----+-------+-------------------+ +| id | col_1 | col_2 | ++----+-------+-------------------+ +| 10 | false | 32 | +| 50 | true | 3.333333333333333 | ++----+-------+-------------------+</code></pre> + + <p class="p"> + When used in an <code class="ph codeph">INSERT</code> statement, the Impala <code class="ph codeph">VALUES</code> clause can specify + some or all of the columns in the destination table, and the columns can be specified in a different order + than they actually appear in the table. To specify a different set or order of columns than in the table, + use the syntax: + </p> + +<pre class="pre codeblock"><code>INSERT INTO <var class="keyword varname">destination</var> + (<var class="keyword varname">col_x</var>, <var class="keyword varname">col_y</var>, <var class="keyword varname">col_z</var>) + VALUES + (<var class="keyword varname">val_x</var>, <var class="keyword varname">val_y</var>, <var class="keyword varname">val_z</var>); +</code></pre> + + <p class="p"> + Any columns in the table that are not listed in the <code class="ph codeph">INSERT</code> statement are set to + <code class="ph codeph">NULL</code>. + </p> + + + + <p class="p"> + To use a <code class="ph codeph">VALUES</code> clause like a table in other statements, wrap it in parentheses and use + <code class="ph codeph">AS</code> clauses to specify aliases for the entire object and any columns you need to refer to: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select * from (values(4,5,6),(7,8,9)) as t; ++---+---+---+ +| 4 | 5 | 6 | ++---+---+---+ +| 4 | 5 | 6 | +| 7 | 8 | 9 | ++---+---+---+ +[localhost:21000] > select * from (values(1 as c1, true as c2, 'abc' as c3),(100,false,'xyz')) as t; ++-----+-------+-----+ +| c1 | c2 | c3 | ++-----+-------+-----+ +| 1 | true | abc | +| 100 | false | xyz | ++-----+-------+-----+</code></pre> + + <p class="p"> + For example, you might use a tiny table constructed like this from constant literals or function return + values as part of a longer statement involving joins or <code class="ph codeph">UNION ALL</code>. + </p> + + <p class="p"> + <strong class="ph b">HDFS considerations:</strong> + </p> + + <p class="p"> + Impala physically writes all inserted files under the ownership of its default user, typically + <code class="ph codeph">impala</code>. Therefore, this user must have HDFS write permission in the corresponding table + directory. + </p> + + <p class="p"> + The permission requirement is independent of the authorization performed by the Sentry framework. (If the + connected user is not authorized to insert into a table, Sentry blocks that operation immediately, + regardless of the privileges available to the <code class="ph codeph">impala</code> user.) Files created by Impala are + not owned by and do not inherit permissions from the connected user. + </p> + + <p class="p"> + The number of data files produced by an <code class="ph codeph">INSERT</code> statement depends on the size of the + cluster, the number of data blocks that are processed, the partition key columns in a partitioned table, + and the mechanism Impala uses for dividing the work in parallel. Do not assume that an + <code class="ph codeph">INSERT</code> statement will produce some particular number of output files. In case of + performance issues with data written by Impala, check that the output files do not suffer from issues such + as many tiny files or many tiny partitions. (In the Hadoop context, even files or partitions of a few tens + of megabytes are considered <span class="q">"tiny"</span>.) + </p> + + <p class="p"> + The <code class="ph codeph">INSERT</code> statement has always left behind a hidden work directory inside the data + directory of the table. Formerly, this hidden work directory was named + <span class="ph filepath">.impala_insert_staging</span> . In Impala 2.0.1 and later, this directory name is changed to + <span class="ph filepath">_impala_insert_staging</span> . (While HDFS tools are expected to treat names beginning + either with underscore and dot as hidden, in practice names beginning with an underscore are more widely + supported.) If you have any scripts, cleanup jobs, and so on that rely on the name of this work directory, + adjust them to use the new name. + </p> + + <p class="p"> + <strong class="ph b">HBase considerations:</strong> + </p> + + <p class="p"> + You can use the <code class="ph codeph">INSERT</code> statement with HBase tables as follows: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + You can insert a single row or a small set of rows into an HBase table with the <code class="ph codeph">INSERT ... + VALUES</code> syntax. This is a good use case for HBase tables with Impala, because HBase tables are + not subject to the same kind of fragmentation from many small insert operations as HDFS tables are. + </p> + </li> + + <li class="li"> + <p class="p"> + You can insert any number of rows at once into an HBase table using the <code class="ph codeph">INSERT ... + SELECT</code> syntax. + </p> + </li> + + <li class="li"> + <p class="p"> + If more than one inserted row has the same value for the HBase key column, only the last inserted row + with that value is visible to Impala queries. You can take advantage of this fact with <code class="ph codeph">INSERT + ... VALUES</code> statements to effectively update rows one at a time, by inserting new rows with the + same key values as existing rows. Be aware that after an <code class="ph codeph">INSERT ... SELECT</code> operation + copying from an HDFS table, the HBase table might contain fewer rows than were inserted, if the key + column in the source table contained duplicate values. + </p> + </li> + + <li class="li"> + <p class="p"> + You cannot <code class="ph codeph">INSERT OVERWRITE</code> into an HBase table. New rows are always appended. + </p> + </li> + + <li class="li"> + <p class="p"> + When you create an Impala or Hive table that maps to an HBase table, the column order you specify with + the <code class="ph codeph">INSERT</code> statement might be different than the order you declare with the + <code class="ph codeph">CREATE TABLE</code> statement. Behind the scenes, HBase arranges the columns based on how + they are divided into column families. This might cause a mismatch during insert operations, especially + if you use the syntax <code class="ph codeph">INSERT INTO <var class="keyword varname">hbase_table</var> SELECT * FROM + <var class="keyword varname">hdfs_table</var></code>. Before inserting data, verify the column order by issuing a + <code class="ph codeph">DESCRIBE</code> statement for the table, and adjust the order of the select list in the + <code class="ph codeph">INSERT</code> statement. + </p> + </li> + </ul> + + <p class="p"> + See <a class="xref" href="impala_hbase.html#impala_hbase">Using Impala to Query HBase Tables</a> for more details about using Impala with HBase. + </p> + + <p class="p"> + <strong class="ph b">Amazon S3 considerations:</strong> + </p> + <p class="p"> + In <span class="keyword">Impala 2.6</span> and higher, the Impala DML statements (<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, + and <code class="ph codeph">CREATE TABLE AS SELECT</code>) can write data into a table or partition that resides in the + Amazon Simple Storage Service (S3). + The syntax of the DML statements is the same as for any other tables, because the S3 location for tables and + partitions is specified by an <code class="ph codeph">s3a://</code> prefix in the + <code class="ph codeph">LOCATION</code> attribute of + <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements. + If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements, + issue a <code class="ph codeph">REFRESH</code> statement for the table before using Impala to query the S3 data. + </p> + <p class="p"> + Because of differences between S3 and traditional filesystems, DML operations + for S3 tables can take longer than for tables on HDFS. For example, both the + <code class="ph codeph">LOAD DATA</code> statement and the final stage of the <code class="ph codeph">INSERT</code> + and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements involve moving files from one directory + to another. (In the case of <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code>, + the files are moved from a temporary staging directory to the final destination directory.) + Because S3 does not support a <span class="q">"rename"</span> operation for existing objects, in these cases Impala + actually copies the data files from one location to another and then removes the original files. + In <span class="keyword">Impala 2.6</span>, the <code class="ph codeph">S3_SKIP_INSERT_STAGING</code> query option provides a way + to speed up <code class="ph codeph">INSERT</code> statements for S3 tables and partitions, with the tradeoff + that a problem during statement execution could leave data in an inconsistent state. + It does not apply to <code class="ph codeph">INSERT OVERWRITE</code> or <code class="ph codeph">LOAD DATA</code> statements. + See <a class="xref" href="../shared/../topics/impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a> for details. + </p> + <p class="p">See <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about reading and writing S3 data with Impala.</p> + + <p class="p"> + <strong class="ph b">Security considerations:</strong> + </p> + <p class="p"> + If these statements in your environment contain sensitive literal values such as credit card numbers or tax + identifiers, Impala can redact this sensitive information when displaying the statements in log files and + other administrative contexts. See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details. + </p> + + <p class="p"> + <strong class="ph b">Cancellation:</strong> Can be cancelled. To cancel this statement, use Ctrl-C from the + <span class="keyword cmdname">impala-shell</span> interpreter, the <span class="ph uicontrol">Cancel</span> button from the + <span class="ph uicontrol">Watch</span> page in Hue, or <span class="ph uicontrol">Cancel</span> from the list of + in-flight queries (for a particular node) on the <span class="ph uicontrol">Queries</span> tab in the Impala web UI + (port 25000). + </p> + + <p class="p"> + <strong class="ph b">HDFS permissions:</strong> + </p> + <p class="p"> + The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under, + typically the <code class="ph codeph">impala</code> user, must have read + permission for the files in the source directory of an <code class="ph codeph">INSERT ... SELECT</code> + operation, and write permission for all affected directories in the destination table. + (An <code class="ph codeph">INSERT</code> operation could write files to multiple different HDFS directories + if the destination table is partitioned.) + This user must also have write permission to create a temporary work directory + in the top-level HDFS directory of the destination table. + An <code class="ph codeph">INSERT OVERWRITE</code> operation does not require write permission on + the original data files in the table, only on the table directories themselves. + </p> + + <p class="p"> + <strong class="ph b">Restrictions:</strong> + </p> + + <p class="p"> + For <code class="ph codeph">INSERT</code> operations into <code class="ph codeph">CHAR</code> or <code class="ph codeph">VARCHAR</code> columns, you + must cast all <code class="ph codeph">STRING</code> literals or expressions returning <code class="ph codeph">STRING</code> to to a + <code class="ph codeph">CHAR</code> or <code class="ph codeph">VARCHAR</code> type with the appropriate length. + </p> + + <p class="p"> + <strong class="ph b">Related startup options:</strong> + </p> + + <p class="p"> + By default, if an <code class="ph codeph">INSERT</code> statement creates any new subdirectories underneath a partitioned + table, those subdirectories are assigned default HDFS permissions for the <code class="ph codeph">impala</code> user. To + make each subdirectory have the same permissions as its parent directory in HDFS, specify the + <code class="ph codeph">--insert_inherit_permissions</code> startup option for the <span class="keyword cmdname">impalad</span> daemon. + </p> + </div> + </article> + + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_install.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_install.html b/docs/build/html/topics/impala_install.html new file mode 100644 index 0000000..561d3b4 --- /dev/null +++ b/docs/build/html/topics/impala_install.html @@ -0,0 +1,126 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="install"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Installing Impala</title></head><body id="install"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1"><span class="ph">Installing Impala</span></h1> + + + <div class="body conbody"> + + <p class="p"> + + + + + + + + + Impala is an open-source analytic database for Apache Hadoop + that returns rapid responses to queries. + </p> + + <p class="p"> + Follow these steps to set up Impala on a cluster by building from source: + </p> + + + + <ul class="ul"> + <li class="li"> + <p class="p"> + Download the latest release. See + <a class="xref" href="http://impala.apache.org/downloads.html" target="_blank">the Impala downloads page</a> + for the link to the latest release. + </p> + </li> + <li class="li"> + <p class="p"> + Check the <span class="ph filepath">README.md</span> file for a pointer + to the build instructions. + </p> + </li> + <li class="li"> + <p class="p"> + Please check the MD5 and SHA1 and GPG signature, the latter by using the code signing keys of the release managers. + </p> + </li> + <li class="li"> + <div class="p"> + Developers interested in working on Impala can clone the Impala source repository: +<pre class="pre codeblock"><code> +git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git +</code></pre> + </div> + </li> + </ul> + + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="install__install_details"> + + <h2 class="title topictitle2" id="ariaid-title2">What is Included in an Impala Installation</h2> + + <div class="body conbody"> + + <p class="p"> + Impala is made up of a set of components that can be installed on multiple nodes throughout your cluster. + The key installation step for performance is to install the <span class="keyword cmdname">impalad</span> daemon (which does + most of the query processing work) on <em class="ph i">all</em> DataNodes in the cluster. + </p> + + <p class="p"> + The Impala package installs these binaries: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + <span class="keyword cmdname">impalad</span> - The Impala daemon. Plans and executes queries against HDFS, HBase, <span class="ph">and Amazon S3 data</span>. + <a class="xref" href="impala_processes.html#processes">Run one impalad process</a> on each node in the cluster + that has a DataNode. + </p> + </li> + + <li class="li"> + <p class="p"> + <span class="keyword cmdname">statestored</span> - Name service that tracks location and status of all + <code class="ph codeph">impalad</code> instances in the cluster. <a class="xref" href="impala_processes.html#processes">Run one + instance of this daemon</a> on a node in your cluster. Most production deployments run this daemon + on the namenode. + </p> + </li> + + <li class="li"> + <p class="p"> + <span class="keyword cmdname">catalogd</span> - Metadata coordination service that broadcasts changes from Impala DDL and + DML statements to all affected Impala nodes, so that new tables, newly loaded data, and so on are + immediately visible to queries submitted through any Impala node. + + (Prior to Impala 1.2, you had to run the <code class="ph codeph">REFRESH</code> or <code class="ph codeph">INVALIDATE + METADATA</code> statement on each node to synchronize changed metadata. Now those statements are only + required if you perform the DDL or DML through an external mechanism such as Hive <span class="ph">or by uploading + data to the Amazon S3 filesystem</span>.) + <a class="xref" href="impala_processes.html#processes">Run one instance of this daemon</a> on a node in your cluster, + preferably on the same host as the <code class="ph codeph">statestored</code> daemon. + </p> + </li> + + <li class="li"> + <p class="p"> + <span class="keyword cmdname">impala-shell</span> - <a class="xref" href="impala_impala_shell.html#impala_shell">Command-line + interface</a> for issuing queries to the Impala daemon. You install this on one or more hosts + anywhere on your network, not necessarily DataNodes or even within the same cluster as Impala. It can + connect remotely to any instance of the Impala daemon. + </p> + </li> + </ul> + + <p class="p"> + Before doing the installation, ensure that you have all necessary prerequisites. See + <a class="xref" href="impala_prereqs.html#prereqs">Impala Requirements</a> for details. + </p> + </div> + </article> + +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_int.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_int.html b/docs/build/html/topics/impala_int.html new file mode 100644 index 0000000..2fcd403 --- /dev/null +++ b/docs/build/html/topics/impala_int.html @@ -0,0 +1,119 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_datatypes.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="int"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>INT Data Type</title></head><body id="int"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">INT Data Type</h1> + + + + <div class="body conbody"> + + <p class="p"> + A 4-byte integer data type used in <code class="ph codeph">CREATE TABLE</code> and <code class="ph codeph">ALTER TABLE</code> statements. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + In the column definition of a <code class="ph codeph">CREATE TABLE</code> statement: + </p> + +<pre class="pre codeblock"><code><var class="keyword varname">column_name</var> INT</code></pre> + + <p class="p"> + <strong class="ph b">Range:</strong> -2147483648 .. 2147483647. There is no <code class="ph codeph">UNSIGNED</code> subtype. + </p> + + <p class="p"> + <strong class="ph b">Conversions:</strong> Impala automatically converts to a larger integer type (<code class="ph codeph">BIGINT</code>) or a + floating-point type (<code class="ph codeph">FLOAT</code> or <code class="ph codeph">DOUBLE</code>) automatically. Use + <code class="ph codeph">CAST()</code> to convert to <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, + <code class="ph codeph">STRING</code>, or <code class="ph codeph">TIMESTAMP</code>. + <span class="ph">Casting an integer or floating-point value <code class="ph codeph">N</code> to + <code class="ph codeph">TIMESTAMP</code> produces a value that is <code class="ph codeph">N</code> seconds past the start of the epoch + date (January 1, 1970). By default, the result value represents a date and time in the UTC time zone. + If the setting <code class="ph codeph">-use_local_tz_for_unix_timestamp_conversions=true</code> is in effect, + the resulting <code class="ph codeph">TIMESTAMP</code> represents a date and time in the local time zone.</span> + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + The data type <code class="ph codeph">INTEGER</code> is an alias for <code class="ph codeph">INT</code>. + </p> + + <p class="p"> + For a convenient and automated way to check the bounds of the <code class="ph codeph">INT</code> type, call the functions + <code class="ph codeph">MIN_INT()</code> and <code class="ph codeph">MAX_INT()</code>. + </p> + + <p class="p"> + If an integer value is too large to be represented as a <code class="ph codeph">INT</code>, use a <code class="ph codeph">BIGINT</code> + instead. + </p> + + <p class="p"> + <strong class="ph b">NULL considerations:</strong> Casting any non-numeric value to this type produces a <code class="ph codeph">NULL</code> + value. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + +<pre class="pre codeblock"><code>CREATE TABLE t1 (x INT); +SELECT CAST(1000 AS INT); +</code></pre> + + <p class="p"> + <strong class="ph b">Partitioning:</strong> Prefer to use this type for a partition key column. Impala can process the numeric + type more efficiently than a <code class="ph codeph">STRING</code> representation of the value. + </p> + + <p class="p"> + <strong class="ph b">HBase considerations:</strong> This data type is fully compatible with HBase tables. + </p> + + <p class="p"> + <strong class="ph b">Parquet considerations:</strong> + </p> + + <p class="p"> + <strong class="ph b">Text table considerations:</strong> Values of this type are potentially larger in text tables than in tables + using Parquet or other binary formats. + </p> + + + + <p class="p"> + <strong class="ph b">Internal details:</strong> Represented in memory as a 4-byte value. + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> Available in all versions of Impala. + </p> + + <p class="p"> + <strong class="ph b">Column statistics considerations:</strong> Because this type has a fixed size, the maximum and average size + fields are always filled in for column statistics, even before you run the <code class="ph codeph">COMPUTE STATS</code> + statement. + </p> + + + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + <a class="xref" href="impala_literals.html#numeric_literals">Numeric Literals</a>, <a class="xref" href="impala_tinyint.html#tinyint">TINYINT Data Type</a>, + <a class="xref" href="impala_smallint.html#smallint">SMALLINT Data Type</a>, <a class="xref" href="impala_int.html#int">INT Data Type</a>, + <a class="xref" href="impala_bigint.html#bigint">BIGINT Data Type</a>, <a class="xref" href="impala_decimal.html#decimal">DECIMAL Data Type (Impala 1.4 or higher only)</a>, + <a class="xref" href="impala_math_functions.html#math_functions">Impala Mathematical Functions</a> + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_datatypes.html">Data Types</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_intro.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_intro.html b/docs/build/html/topics/impala_intro.html new file mode 100644 index 0000000..cdf05b7 --- /dev/null +++ b/docs/build/html/topics/impala_intro.html @@ -0,0 +1,198 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="intro"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Introducing Apache Impala (incubating)</title></head><body id="intro"><main role="main"><article role="article" aria-labelledby="intro__impala"> + + <h1 class="title topictitle1" id="intro__impala"><span class="ph">Introducing Apache Impala (incubating)</span></h1> + + + <div class="body conbody" id="intro__intro_body"> + + <p class="p"> + Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, + HBase, <span class="ph">or the Amazon Simple Storage Service (S3)</span>. + In addition to using the same unified storage platform, + Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface + (Impala query UI in Hue) as Apache Hive. This + provides a familiar and unified platform for real-time or batch-oriented queries. + </p> + + <p class="p"> + Impala is an addition to tools available for querying big data. Impala does not replace the batch + processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are + best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, + and Load (ETL) type jobs. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Impala was accepted into the Apache incubator on December 2, 2015. + In places where the documentation formerly referred to <span class="q">"Cloudera Impala"</span>, + now the official name is <span class="q">"Apache Impala (incubating)"</span>. + </div> + + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="intro__benefits"> + + <h2 class="title topictitle2" id="ariaid-title2">Impala Benefits</h2> + + <div class="body conbody"> + + <div class="p"> + Impala provides: + + <ul class="ul"> + <li class="li"> + Familiar SQL interface that data scientists and analysts already know. + </li> + + <li class="li"> + Ability to query high volumes of data (<span class="q">"big data"</span>) in Apache Hadoop. + </li> + + <li class="li"> + Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective + commodity hardware. + </li> + + <li class="li"> + Ability to share data files between different components with no copy or export/import step; for example, + to write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive + tables, enabling simple data interchange using Impala for analytics on Hive-produced data. + </li> + + <li class="li"> + Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just + for analytics. + </li> + </ul> + </div> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="intro__impala_hadoop"> + + <h2 class="title topictitle2" id="ariaid-title3">How Impala Works with <span class="keyword">Apache Hadoop</span></h2> + + + <div class="body conbody"> + + + + <div class="p"> + The Impala solution is composed of the following components: + <ul class="ul"> + <li class="li"> + Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact + with Impala. These interfaces are typically used to issue queries or complete administrative tasks such + as connecting to Impala. + </li> + + <li class="li"> + Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets + Impala know what databases are available and what the structure of those databases is. As you create, + drop, and alter schema objects, load data into tables, and so on through Impala SQL statements, the + relevant metadata changes are automatically broadcast to all Impala nodes by the dedicated catalog + service introduced in Impala 1.2. + </li> + + <li class="li"> + Impala - This process, which runs on DataNodes, coordinates and executes queries. Each + instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are + distributed among Impala nodes, and these nodes then act as workers, executing parallel query + fragments. + </li> + + <li class="li"> + HBase and HDFS - Storage for data to be queried. + </li> + </ul> + </div> + + <div class="p"> + Queries executed using Impala are handled as follows: + <ol class="ol"> + <li class="li"> + User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying + interfaces. The user application may connect to any <code class="ph codeph">impalad</code> in the cluster. This + <code class="ph codeph">impalad</code> becomes the coordinator for the query. + </li> + + <li class="li"> + Impala parses the query and analyzes it to determine what tasks need to be performed by + <code class="ph codeph">impalad</code> instances across the cluster. Execution is planned for optimal efficiency. + </li> + + <li class="li"> + Services such as HDFS and HBase are accessed by local <code class="ph codeph">impalad</code> instances to provide + data. + </li> + + <li class="li"> + Each <code class="ph codeph">impalad</code> returns data to the coordinating <code class="ph codeph">impalad</code>, which sends + these results to the client. + </li> + </ol> + </div> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="intro__features"> + + <h2 class="title topictitle2" id="ariaid-title4">Primary Impala Features</h2> + + <div class="body conbody"> + + <div class="p"> + Impala provides support for: + <ul class="ul"> + <li class="li"> + Most common SQL-92 features of Hive Query Language (HiveQL) including + <a class="xref" href="../shared/../topics/impala_select.html#select">SELECT</a>, + <a class="xref" href="../shared/../topics/impala_joins.html#joins">joins</a>, and + <a class="xref" href="../shared/../topics/impala_aggregate_functions.html#aggregate_functions">aggregate functions</a>. + </li> + + <li class="li"> + HDFS, HBase, <span class="ph">and Amazon Simple Storage System (S3)</span> storage, including: + <ul class="ul"> + <li class="li"> + <a class="xref" href="../shared/../topics/impala_file_formats.html#file_formats">HDFS file formats</a>: delimited text files, Parquet, + Avro, SequenceFile, and RCFile. + </li> + + <li class="li"> + Compression codecs: Snappy, GZIP, Deflate, BZIP. + </li> + </ul> + </li> + + <li class="li"> + Common data access interfaces including: + <ul class="ul"> + <li class="li"> + <a class="xref" href="../shared/../topics/impala_jdbc.html#impala_jdbc">JDBC driver</a>. + </li> + + <li class="li"> + <a class="xref" href="../shared/../topics/impala_odbc.html#impala_odbc">ODBC driver</a>. + </li> + + <li class="li"> + Hue Beeswax and the Impala Query UI. + </li> + </ul> + </li> + + <li class="li"> + <a class="xref" href="../shared/../topics/impala_impala_shell.html#impala_shell">impala-shell command-line interface</a>. + </li> + + <li class="li"> + <a class="xref" href="../shared/../topics/impala_security.html#security">Kerberos authentication</a>. + </li> + </ul> + </div> + </div> + </article> +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_invalidate_metadata.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_invalidate_metadata.html b/docs/build/html/topics/impala_invalidate_metadata.html new file mode 100644 index 0000000..c06b9c9 --- /dev/null +++ b/docs/build/html/topics/impala_invalidate_metadata.html @@ -0,0 +1,294 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="invalidate_metadata"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>INVALIDATE METADATA Statement</title></head><body id="invalidate_metadata"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">INVALIDATE METADATA Statement</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, + before the table is available for Impala queries. The next time the current Impala node performs a query + against a table whose metadata is invalidated, Impala reloads the associated metadata before the query + proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the + <code class="ph codeph">REFRESH</code> statement, so in the common scenario of adding new data files to an existing table, + prefer <code class="ph codeph">REFRESH</code> rather than <code class="ph codeph">INVALIDATE METADATA</code>. If you are not familiar + with the way Impala uses metadata and how it shares the same metastore database as Hive, see + <a class="xref" href="impala_hadoop.html#intro_metastore">Overview of Impala Metadata and the Metastore</a> for background information. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + +<pre class="pre codeblock"><code>INVALIDATE METADATA [[<var class="keyword varname">db_name</var>.]<var class="keyword varname">table_name</var>]</code></pre> + + <p class="p"> + By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for + that one table is flushed. Even for a single table, <code class="ph codeph">INVALIDATE METADATA</code> is more expensive + than <code class="ph codeph">REFRESH</code>, so prefer <code class="ph codeph">REFRESH</code> in the common case where you add new data + files for an existing table. + </p> + + <p class="p"> + <strong class="ph b">Internal details:</strong> + </p> + + <p class="p"> + To accurately respond to queries, Impala must have current metadata about those databases and tables that + clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore + that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean + that all metadata updates require an Impala update. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + In Impala 1.2.4 and higher, you can specify a table name with <code class="ph codeph">INVALIDATE METADATA</code> after + the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full + reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast + mechanism faster and more responsive, especially during Impala startup. See + <a class="xref" href="../shared/../topics/impala_new_features.html#new_features_124">New Features in Impala 1.2.4</a> for details. + </p> + <p class="p"> + In Impala 1.2 and higher, a dedicated daemon (<span class="keyword cmdname">catalogd</span>) broadcasts DDL changes made + through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one + Impala node, you needed to issue an <code class="ph codeph">INVALIDATE METADATA</code> statement on another Impala node + before accessing the new database or table from the other node. Now, newly created or altered objects are + picked up automatically by all Impala nodes. You must still use the <code class="ph codeph">INVALIDATE METADATA</code> + technique after creating or altering objects through Hive. See + <a class="xref" href="impala_components.html#intro_catalogd">The Impala Catalog Service</a> for more information on the catalog service. + </p> + <p class="p"> + The <code class="ph codeph">INVALIDATE METADATA</code> statement is new in Impala 1.1 and higher, and takes over some of + the use cases of the Impala 1.0 <code class="ph codeph">REFRESH</code> statement. Because <code class="ph codeph">REFRESH</code> now + requires a table name parameter, to flush the metadata for all tables at once, use the <code class="ph codeph">INVALIDATE + METADATA</code> statement. + </p> + <p class="p"> + Because <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> only works for tables that the current + Impala node is already aware of, when you create a new table in the Hive shell, enter + <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">new_table</var></code> before you can see the new table in + <span class="keyword cmdname">impala-shell</span>. Once the table is known by Impala, you can issue <code class="ph codeph">REFRESH + <var class="keyword varname">table_name</var></code> after you add data files for that table. + </p> + </div> + + <p class="p"> + <code class="ph codeph">INVALIDATE METADATA</code> and <code class="ph codeph">REFRESH</code> are counterparts: <code class="ph codeph">INVALIDATE + METADATA</code> waits to reload the metadata when needed for a subsequent query, but reloads all the + metadata for the table, which can be an expensive operation, especially for large tables with many + partitions. <code class="ph codeph">REFRESH</code> reloads the metadata immediately, but only loads the block location + data for newly added data files, making it a less expensive operation overall. If data was altered in some + more extensive way, such as being reorganized by the HDFS balancer, use <code class="ph codeph">INVALIDATE + METADATA</code> to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, + the <code class="ph codeph">INVALIDATE METADATA</code> statement works just like the Impala 1.0 <code class="ph codeph">REFRESH</code> + statement did, while the Impala 1.1 <code class="ph codeph">REFRESH</code> is optimized for the common use case of adding + new data files to an existing table, thus the table name argument is now required. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + A metadata update for an <code class="ph codeph">impalad</code> instance <strong class="ph b">is</strong> required if: + </p> + + <ul class="ul"> + <li class="li"> + A metadata change occurs. + </li> + + <li class="li"> + <strong class="ph b">and</strong> the change is made from another <code class="ph codeph">impalad</code> instance in your cluster, or through + Hive. + </li> + + <li class="li"> + <strong class="ph b">and</strong> the change is made to a metastore database to which clients such as the Impala shell or ODBC directly + connect. + </li> + </ul> + + <p class="p"> + A metadata update for an Impala node is <strong class="ph b">not</strong> required when you issue queries from the same Impala node + where you ran <code class="ph codeph">ALTER TABLE</code>, <code class="ph codeph">INSERT</code>, or other table-modifying statement. + </p> + + <p class="p"> + Database and table metadata is typically modified by: + </p> + + <ul class="ul"> + <li class="li"> + Hive - via <code class="ph codeph">ALTER</code>, <code class="ph codeph">CREATE</code>, <code class="ph codeph">DROP</code> or + <code class="ph codeph">INSERT</code> operations. + </li> + + <li class="li"> + Impalad - via <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">ALTER TABLE</code>, and <code class="ph codeph">INSERT</code> + operations. + </li> + </ul> + + <p class="p"> + <code class="ph codeph">INVALIDATE METADATA</code> causes the metadata for that table to be marked as stale, and reloaded + the next time the table is referenced. For a huge table, that process could take a noticeable amount of time; + thus you might prefer to use <code class="ph codeph">REFRESH</code> where practical, to avoid an unpredictable delay later, + for example if the next reference to the table is during a benchmark test. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + The following example shows how you might use the <code class="ph codeph">INVALIDATE METADATA</code> statement after + creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the + <code class="ph codeph">INVALIDATE METADATA</code> statement was issued, Impala would give a <span class="q">"table not found"</span> error + if you tried to refer to those table names. The <code class="ph codeph">DESCRIBE</code> statements cause the latest + metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried. + </p> + +<pre class="pre codeblock"><code>[impalad-host:21000] > invalidate metadata; +[impalad-host:21000] > describe t1; +... +[impalad-host:21000] > describe t2; +... </code></pre> + + <p class="p"> + For more examples of using <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> with a + combination of Impala and Hive operations, see <a class="xref" href="impala_tutorial.html#tutorial_impala_hive">Switching Back and Forth Between Impala and Hive</a>. + </p> + + <p class="p"> + If you need to ensure that the metadata is up-to-date when you start an <span class="keyword cmdname">impala-shell</span> + session, run <span class="keyword cmdname">impala-shell</span> with the <code class="ph codeph">-r</code> or + <code class="ph codeph">--refresh_after_connect</code> command-line option. Because this operation adds a delay to the next + query against each table, potentially expensive for large tables with many partitions, try to avoid using + this option for day-to-day operations in a production environment. + </p> + + <p class="p"> + <strong class="ph b">HDFS permissions:</strong> + </p> + <p class="p"> + The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under, + typically the <code class="ph codeph">impala</code> user, must have execute + permissions for all the relevant directories holding table data. + (A table could have data spread across multiple directories, + or in unexpected paths, if it uses partitioning or + specifies a <code class="ph codeph">LOCATION</code> attribute for + individual partitions or the entire table.) + Issues with permissions might not cause an immediate error for this statement, + but subsequent statements such as <code class="ph codeph">SELECT</code> + or <code class="ph codeph">SHOW TABLE STATS</code> could fail. + </p> + + <p class="p"> + <strong class="ph b">HDFS considerations:</strong> + </p> + + <p class="p"> + By default, the <code class="ph codeph">INVALIDATE METADATA</code> command checks HDFS permissions of the underlying data + files and directories, caching this information so that a statement can be cancelled immediately if for + example the <code class="ph codeph">impala</code> user does not have permission to write to the data directory for the + table. (This checking does not apply if you have set the <span class="keyword cmdname">catalogd</span> configuration option + <code class="ph codeph">--load_catalog_in_background=false</code>.) Impala reports any lack of write permissions as an + <code class="ph codeph">INFO</code> message in the log file, in case that represents an oversight. If you change HDFS + permissions to make data readable or writeable by the Impala user, issue another <code class="ph codeph">INVALIDATE + METADATA</code> to make Impala aware of the change. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + This example illustrates creating a new database and new table in Hive, then doing an <code class="ph codeph">INVALIDATE + METADATA</code> statement in Impala using the fully qualified table name, after which both the new table + and the new database are visible to Impala. The ability to specify <code class="ph codeph">INVALIDATE METADATA + <var class="keyword varname">table_name</var></code> for a table created in Hive is a new capability in Impala 1.2.4. In + earlier releases, that statement would have returned an error indicating an unknown table, requiring you to + do <code class="ph codeph">INVALIDATE METADATA</code> with no table name, a more expensive operation that reloaded metadata + for all tables and databases. + </p> + +<pre class="pre codeblock"><code>$ hive +hive> create database new_db_from_hive; +OK +Time taken: 4.118 seconds +hive> create table new_db_from_hive.new_table_from_hive (x int); +OK +Time taken: 0.618 seconds +hive> quit; +$ impala-shell +[localhost:21000] > show databases like 'new*'; +[localhost:21000] > refresh new_db_from_hive.new_table_from_hive; +ERROR: AnalysisException: Database does not exist: new_db_from_hive +[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive; +[localhost:21000] > show databases like 'new*'; ++--------------------+ +| name | ++--------------------+ +| new_db_from_hive | ++--------------------+ +[localhost:21000] > show tables in new_db_from_hive; ++---------------------+ +| name | ++---------------------+ +| new_table_from_hive | ++---------------------+</code></pre> + + <p class="p"> + <strong class="ph b">Amazon S3 considerations:</strong> + </p> + <p class="p"> + The <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> statements also cache metadata + for tables where the data resides in the Amazon Simple Storage Service (S3). + In particular, issue a <code class="ph codeph">REFRESH</code> for a table after adding or removing files + in the associated S3 data directory. + See <a class="xref" href="../shared/../topics/impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about working with S3 tables. + </p> + + <p class="p"> + <strong class="ph b">Cancellation:</strong> Cannot be cancelled. + </p> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + <p class="p"> + Much of the metadata for Kudu tables is handled by the underlying + storage layer. Kudu tables have less reliance on the metastore + database, and require less metadata caching on the Impala side. + For example, information about partitions in Kudu tables is managed + by Kudu, and Impala does not cache any block locality metadata + for Kudu tables. + </p> + <p class="p"> + The <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> + statements are needed less frequently for Kudu tables than for + HDFS-backed tables. Neither statement is needed when data is + added to, removed, or updated in a Kudu table, even if the changes + are made directly to Kudu through a client program using the Kudu API. + Run <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> or + <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code> + for a Kudu table only after making a change to the Kudu table schema, + such as adding or dropping a column, by a mechanism other than + Impala. + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_hadoop.html#intro_metastore">Overview of Impala Metadata and the Metastore</a>, + <a class="xref" href="impala_refresh.html#refresh">REFRESH Statement</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html> \ No newline at end of file
