http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_describe.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_describe.xml b/docs/topics/impala_describe.xml new file mode 100644 index 0000000..c4070c3 --- /dev/null +++ b/docs/topics/impala_describe.xml @@ -0,0 +1,689 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="describe"> + + <title id="desc">DESCRIBE Statement</title> + <titlealts audience="PDF"><navtitle>DESCRIBE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Reports"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DESCRIBE statement</indexterm> + The <codeph>DESCRIBE</codeph> statement displays metadata about a table, such as the column names and their + data types. + <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, you can specify the name of a complex type column, which takes + the form of a dotted path. The path might include multiple components in the case of a nested type definition.</ph> + <ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the <codeph>DESCRIBE DATABASE</codeph> form can display + information about a database.</ph> + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock rev="2.5.0">DESCRIBE [DATABASE] [FORMATTED|EXTENDED] <varname>object_name</varname> + +object_name ::= + [<varname>db_name</varname>.]<varname>table_name</varname>[.<varname>complex_col_name</varname> ...] + | <varname>db_name</varname> +</codeblock> + + <p> + You can use the abbreviation <codeph>DESC</codeph> for the <codeph>DESCRIBE</codeph> statement. + </p> + + <p rev="1.1"> + The <codeph>DESCRIBE FORMATTED</codeph> variation displays additional information, in a format familiar to + users of Apache Hive. The extra information includes low-level details such as whether the table is internal + or external, when it was created, the file format, the location of the data in HDFS, whether the object is a + table or a view, and (for views) the text of the query from the view definition. + </p> + + <note> + The <codeph>Compressed</codeph> field is not a reliable indicator of whether the table contains compressed + data. It typically always shows <codeph>No</codeph>, because the compression settings only apply during the + session that loads data and are not stored persistently with the table metadata. + </note> + +<p rev="2.5.0 IMPALA-2196"> + <b>Describing databases:</b> +</p> + +<p rev="2.5.0"> + By default, the <codeph>DESCRIBE</codeph> output for a database includes the location + and the comment, which can be set by the <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph> + clauses on the <codeph>CREATE DATABASE</codeph> statement. +</p> + +<p rev="2.5.0"> + The additional information displayed by the <codeph>FORMATTED</codeph> or <codeph>EXTENDED</codeph> + keyword includes the HDFS user ID that is considered the owner of the database, and any + optional database properties. The properties could be specified by the <codeph>WITH DBPROPERTIES</codeph> + clause if the database is created using a Hive <codeph>CREATE DATABASE</codeph> statement. + Impala currently does not set or do any special processing based on those properties. +</p> + +<p rev="2.5.0"> +The following examples show the variations in syntax and output for +describing databases. This feature is available in <keyword keyref="impala25_full"/> +and higher. +</p> + +<codeblock rev="2.5.0"> +describe database default; ++---------+----------------------+-----------------------+ +| name | location | comment | ++---------+----------------------+-----------------------+ +| default | /user/hive/warehouse | Default Hive database | ++---------+----------------------+-----------------------+ + +describe database formatted default; ++---------+----------------------+-----------------------+ +| name | location | comment | ++---------+----------------------+-----------------------+ +| default | /user/hive/warehouse | Default Hive database | +| Owner: | | | +| | public | ROLE | ++---------+----------------------+-----------------------+ + +describe database extended default; ++---------+----------------------+-----------------------+ +| name | location | comment | ++---------+----------------------+-----------------------+ +| default | /user/hive/warehouse | Default Hive database | +| Owner: | | | +| | public | ROLE | ++---------+----------------------+-----------------------+ +</codeblock> + +<p> + <b>Describing tables:</b> +</p> + +<p> + If the <codeph>DATABASE</codeph> keyword is omitted, the default + for the <codeph>DESCRIBE</codeph> statement is to refer to a table. +</p> + +<codeblock> +-- By default, the table is assumed to be in the current database. +describe my_table; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| s | string | | ++------+--------+---------+ + +-- Use a fully qualified table name to specify a table in any database. +describe my_database.my_table; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| s | string | | ++------+--------+---------+ + +-- The formatted or extended output includes additional useful information. +-- The LOCATION field is especially useful to know for DDL statements and HDFS commands +-- during ETL jobs. (The LOCATION includes a full hdfs:// URL, omitted here for readability.) +describe formatted my_table; ++------------------------------+----------------------------------------------+----------------------+ +| name | type | comment | ++------------------------------+----------------------------------------------+----------------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | NULL | +| s | string | NULL | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | my_database | NULL | +| Owner: | jrussell | NULL | +| CreateTime: | Fri Mar 18 15:58:00 PDT 2016 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | /user/hive/warehouse/my_database.db/my_table | NULL | +| Table Type: | MANAGED_TABLE | NULL | +| Table Parameters: | NULL | NULL | +| | transient_lastDdlTime | 1458341880 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | org. ... .LazySimpleSerDe | NULL | +| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | +| OutputFormat: | org. ... .HiveIgnoreKeyTextOutputFormat | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | ++------------------------------+----------------------------------------------+----------------------+ +</codeblock> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + Because the column definitions for complex types can become long, particularly when such types are nested, + the <codeph>DESCRIBE</codeph> statement uses special formatting for complex type columns to make the output readable. + </p> + + <p rev="2.3.0"> + For the <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types available in + <keyword keyref="impala23_full"/> and higher, the <codeph>DESCRIBE</codeph> output is formatted to avoid + excessively long lines for multiple fields within a <codeph>STRUCT</codeph>, or a nested sequence of + complex types. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_describe"/> + + <p rev="2.3.0"> + For example, here is the <codeph>DESCRIBE</codeph> output for a table containing a single top-level column + of each complex type: + </p> + +<codeblock rev="2.3.0"><![CDATA[create table t1 (x int, a array<int>, s struct<f1: string, f2: bigint>, m map<string,int>) stored as parquet; + +describe t1; ++------+-----------------+---------+ +| name | type | comment | ++------+-----------------+---------+ +| x | int | | +| a | array<int> | | +| s | struct< | | +| | f1:string, | | +| | f2:bigint | | +| | > | | +| m | map<string,int> | | ++------+-----------------+---------+ +]]> +</codeblock> + + <p rev="2.3.0"> + Here are examples showing how to <q>drill down</q> into the layouts of complex types, including + using multi-part names to examine the definitions of nested types. + The <codeph>< ></codeph> delimiters identify the columns with complex types; + these are the columns where you can descend another level to see the parts that make up + the complex type. + This technique helps you to understand the multi-part names you use as table references in queries + involving complex types, and the corresponding column names you refer to in the <codeph>SELECT</codeph> list. + These tables are from the <q>nested TPC-H</q> schema, shown in detail in + <xref href="impala_complex_types.xml#complex_sample_schema"/>. + </p> + + <p> + The <codeph>REGION</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements: + </p> + + <ul> + <li> + <p> + The first <codeph>DESCRIBE</codeph> specifies the table name, to display the definition + of each top-level column. + </p> + </li> + <li> + <p> + The second <codeph>DESCRIBE</codeph> specifies the name of a complex + column, <codeph>REGION.R_NATIONS</codeph>, showing that when you include the name of an <codeph>ARRAY</codeph> + column in a <codeph>FROM</codeph> clause, that table reference acts like a two-column table with + columns <codeph>ITEM</codeph> and <codeph>POS</codeph>. + </p> + </li> + <li> + <p> + The final <codeph>DESCRIBE</codeph> specifies the fully qualified name of the <codeph>ITEM</codeph> field, + to display the layout of its underlying <codeph>STRUCT</codeph> type in table format, with the fields + mapped to column names. + </p> + </li> + </ul> + +<codeblock rev="2.3.0"><![CDATA[ +-- #1: The overall layout of the entire table. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- #2: The ARRAY column within the table. +describe region.r_nations; ++------+-------------------------+---------+ +| name | type | comment | ++------+-------------------------+---------+ +| item | struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | > | | +| pos | bigint | | ++------+-------------------------+---------+ + +-- #3: The STRUCT that makes up each ARRAY element. +-- The fields of the STRUCT act like columns of a table. +describe region.r_nations.item; ++-------------+----------+---------+ +| name | type | comment | ++-------------+----------+---------+ +| n_nationkey | smallint | | +| n_name | string | | +| n_comment | string | | ++-------------+----------+---------+ +]]> +</codeblock> + + <p> + The <codeph>CUSTOMER</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements, where one field in the <codeph>STRUCT</codeph> is another <codeph>ARRAY</codeph> of + <codeph>STRUCT</codeph> elements: + </p> + <ul> + <li> + <p> + Again, the initial <codeph>DESCRIBE</codeph> specifies only the table name. + </p> + </li> + <li> + <p> + The second <codeph>DESCRIBE</codeph> specifies the qualified name of the complex + column, <codeph>CUSTOMER.C_ORDERS</codeph>, showing how an <codeph>ARRAY</codeph> + is represented as a two-column table with columns <codeph>ITEM</codeph> and <codeph>POS</codeph>. + </p> + </li> + <li> + <p> + The third <codeph>DESCRIBE</codeph> specifies the qualified name of the <codeph>ITEM</codeph> + of the <codeph>ARRAY</codeph> column, to see the structure of the nested <codeph>ARRAY</codeph>. + Again, it has has two parts, <codeph>ITEM</codeph> and <codeph>POS</codeph>. Because the + <codeph>ARRAY</codeph> contains a <codeph>STRUCT</codeph>, the layout of the <codeph>STRUCT</codeph> + is shown. + </p> + </li> + <li> + <p> + The fourth and fifth <codeph>DESCRIBE</codeph> statements drill down into a <codeph>STRUCT</codeph> field that + is itself a complex type, an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>. + The <codeph>ITEM</codeph> portion of the qualified name is only required when the <codeph>ARRAY</codeph> + elements are anonymous. The fields of the <codeph>STRUCT</codeph> give names to any other complex types + nested inside the <codeph>STRUCT</codeph>. Therefore, the <codeph>DESCRIBE</codeph> parameters + <codeph>CUSTOMER.C_ORDERS.ITEM.O_LINEITEMS</codeph> and <codeph>CUSTOMER.C_ORDERS.O_LINEITEMS</codeph> + are equivalent. (For brevity, leave out the <codeph>ITEM</codeph> portion of + a qualified name when it is not required.) + </p> + </li> + <li> + <p> + The final <codeph>DESCRIBE</codeph> shows the layout of the deeply nested <codeph>STRUCT</codeph> type. + Because there are no more complex types nested inside this <codeph>STRUCT</codeph>, this is as far + as you can drill down into the layout for this table. + </p> + </li> + </ul> + +<codeblock rev="2.3.0"><![CDATA[-- #1: The overall layout of the entire table. +describe customer; ++--------------+------------------------------------+ +| name | type | ++--------------+------------------------------------+ +| c_custkey | bigint | +... more scalar columns ... +| c_orders | array<struct< | +| | o_orderkey:bigint, | +| | o_orderstatus:string, | +| | o_totalprice:decimal(12,2), | +| | o_orderdate:string, | +| | o_orderpriority:string, | +| | o_clerk:string, | +| | o_shippriority:int, | +| | o_comment:string, | +| | o_lineitems:array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +| | l_linenumber:int, | +| | l_quantity:decimal(12,2), | +| | l_extendedprice:decimal(12,2), | +| | l_discount:decimal(12,2), | +| | l_tax:decimal(12,2), | +| | l_returnflag:string, | +| | l_linestatus:string, | +| | l_shipdate:string, | +| | l_commitdate:string, | +| | l_receiptdate:string, | +| | l_shipinstruct:string, | +| | l_shipmode:string, | +| | l_comment:string | +| | >> | +| | >> | ++--------------+------------------------------------+ + +-- #2: The ARRAY column within the table. +describe customer.c_orders; ++------+------------------------------------+ +| name | type | ++------+------------------------------------+ +| item | struct< | +| | o_orderkey:bigint, | +| | o_orderstatus:string, | +... more struct fields ... +| | o_lineitems:array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more nested struct fields ... +| | l_comment:string | +| | >> | +| | > | +| pos | bigint | ++------+------------------------------------+ + +-- #3: The STRUCT that makes up each ARRAY element. +-- The fields of the STRUCT act like columns of a table. +describe customer.c_orders.item; ++-----------------+----------------------------------+ +| name | type | ++-----------------+----------------------------------+ +| o_orderkey | bigint | +| o_orderstatus | string | +| o_totalprice | decimal(12,2) | +| o_orderdate | string | +| o_orderpriority | string | +| o_clerk | string | +| o_shippriority | int | +| o_comment | string | +| o_lineitems | array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | >> | ++-----------------+----------------------------------+ + +-- #4: The ARRAY nested inside the STRUCT elements of the first ARRAY. +describe customer.c_orders.item.o_lineitems; ++------+----------------------------------+ +| name | type | ++------+----------------------------------+ +| item | struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | > | +| pos | bigint | ++------+----------------------------------+ + +-- #5: Shorter form of the previous DESCRIBE. Omits the .ITEM portion of the name +-- because O_LINEITEMS and other field names provide a way to refer to things +-- inside the ARRAY element. +describe customer.c_orders.o_lineitems; ++------+----------------------------------+ +| name | type | ++------+----------------------------------+ +| item | struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | > | +| pos | bigint | ++------+----------------------------------+ + +-- #6: The STRUCT representing ARRAY elements nested inside +-- another ARRAY of STRUCTs. The lack of any complex types +-- in this output means this is as far as DESCRIBE can +-- descend into the table layout. +describe customer.c_orders.o_lineitems.item; ++-----------------+---------------+ +| name | type | ++-----------------+---------------+ +| l_partkey | bigint | +| l_suppkey | bigint | +... more scalar columns ... +| l_comment | string | ++-----------------+---------------+ +]]> +</codeblock> + +<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + +<p> + After the <cmdname>impalad</cmdname> daemons are restarted, the first query against a table can take longer + than subsequent queries, because the metadata for the table is loaded before the query is processed. This + one-time delay for each table can cause misleading results in benchmark tests or cause unnecessary concern. + To <q>warm up</q> the Impala metadata cache, you can issue a <codeph>DESCRIBE</codeph> statement in advance + for each table you intend to access later. +</p> + +<p> + When you are dealing with data files stored in HDFS, sometimes it is important to know details such as the + path of the data files for an Impala table, and the hostname for the namenode. You can get this information + from the <codeph>DESCRIBE FORMATTED</codeph> output. You specify HDFS URIs or path specifications with + statements such as <codeph>LOAD DATA</codeph> and the <codeph>LOCATION</codeph> clause of <codeph>CREATE + TABLE</codeph> or <codeph>ALTER TABLE</codeph>. You might also use HDFS URIs or paths with Linux commands + such as <cmdname>hadoop</cmdname> and <cmdname>hdfs</cmdname> to copy, rename, and so on, data files in HDFS. +</p> + +<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + +<p rev="1.2.1"> + Each table can also have associated table statistics and column statistics. To see these categories of + information, use the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and <codeph>SHOW COLUMN + STATS <varname>table_name</varname></codeph> statements. +<!-- +For example, the table statistics can often show you the number +and total size of the files in the table, even if you have not +run <codeph>COMPUTE STATS</codeph>. +--> + See <xref href="impala_show.xml#show"/> for details. +</p> + +<note conref="../shared/impala_common.xml#common/compute_stats_next"/> + +<p conref="../shared/impala_common.xml#common/example_blurb"/> + +<p> + The following example shows the results of both a standard <codeph>DESCRIBE</codeph> and <codeph>DESCRIBE + FORMATTED</codeph> for different kinds of schema objects: +</p> + + <ul> + <li> + <codeph>DESCRIBE</codeph> for a table or a view returns the name, type, and comment for each of the + columns. For a view, if the column value is computed by an expression, the column name is automatically + generated as <codeph>_c0</codeph>, <codeph>_c1</codeph>, and so on depending on the ordinal number of the + column. + </li> + + <li> + A table created with no special format or storage clauses is designated as a <codeph>MANAGED_TABLE</codeph> + (an <q>internal table</q> in Impala terminology). Its data files are stored in an HDFS directory under the + default Hive data directory. By default, it uses Text data format. + </li> + + <li> + A view is designated as <codeph>VIRTUAL_VIEW</codeph> in <codeph>DESCRIBE FORMATTED</codeph> output. Some + of its properties are <codeph>NULL</codeph> or blank because they are inherited from the base table. The + text of the query that defines the view is part of the <codeph>DESCRIBE FORMATTED</codeph> output. + </li> + + <li> + A table with additional clauses in the <codeph>CREATE TABLE</codeph> statement has differences in + <codeph>DESCRIBE FORMATTED</codeph> output. The output for <codeph>T2</codeph> includes the + <codeph>EXTERNAL_TABLE</codeph> keyword because of the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, and + different <codeph>InputFormat</codeph> and <codeph>OutputFormat</codeph> fields to reflect the Parquet file + format. + </li> + </ul> + +<codeblock>[localhost:21000] > create table t1 (x int, y int, s string); +Query: create table t1 (x int, y int, s string) +[localhost:21000] > describe t1; +Query: describe t1 +Query finished, fetching results ... ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| y | int | | +| s | string | | ++------+--------+---------+ +Returned 3 row(s) in 0.13s +[localhost:21000] > describe formatted t1; +Query: describe formatted t1 +Query finished, fetching results ... ++------------------------------+--------------------------------------------+------------+ +| name | type | comment | ++------------------------------+--------------------------------------------+------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| y | int | None | +| s | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 17:03:16 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | hdfs://127.0.0.1:8020/user/hive/warehouse/ | | +| | describe_formatted.db/t1 | NULL | +| Table Type: | MANAGED_TABLE | NULL | +| Table Parameters: | NULL | NULL | +| | transient_lastDdlTime | 1374526996 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | org.apache.hadoop.hive.serde2.lazy. | | +| | LazySimpleSerDe | NULL | +| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | +| OutputFormat: | org.apache.hadoop.hive.ql.io. | | +| | HiveIgnoreKeyTextOutputFormat | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | ++------------------------------+--------------------------------------------+------------+ +Returned 26 row(s) in 0.03s +[localhost:21000] > create view v1 as select x, upper(s) from t1; +Query: create view v1 as select x, upper(s) from t1 +[localhost:21000] > describe v1; +Query: describe v1 +Query finished, fetching results ... ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| _c1 | string | | ++------+--------+---------+ +Returned 2 row(s) in 0.10s +[localhost:21000] > describe formatted v1; +Query: describe formatted v1 +Query finished, fetching results ... ++------------------------------+------------------------------+----------------------+ +| name | type | comment | ++------------------------------+------------------------------+----------------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| _c1 | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 16:56:38 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Table Type: | VIRTUAL_VIEW | NULL | +| Table Parameters: | NULL | NULL | +| | transient_lastDdlTime | 1374526598 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | null | NULL | +| InputFormat: | null | NULL | +| OutputFormat: | null | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | +| | NULL | NULL | +| # View Information | NULL | NULL | +| View Original Text: | SELECT x, upper(s) FROM t1 | NULL | +| View Expanded Text: | SELECT x, upper(s) FROM t1 | NULL | ++------------------------------+------------------------------+----------------------+ +Returned 28 row(s) in 0.03s +[localhost:21000] > create external table t2 (x int, y int, s string) stored as parquet location '/user/cloudera/sample_data'; +[localhost:21000] > describe formatted t2; +Query: describe formatted t2 +Query finished, fetching results ... ++------------------------------+----------------------------------------------------+------------+ +| name | type | comment | ++------------------------------+----------------------------------------------------+------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| y | int | None | +| s | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 17:01:47 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | hdfs://127.0.0.1:8020/user/cloudera/sample_data | NULL | +| Table Type: | EXTERNAL_TABLE | NULL | +| Table Parameters: | NULL | NULL | +| | EXTERNAL | TRUE | +| | transient_lastDdlTime | 1374526907 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | +| InputFormat: | com.cloudera.impala.hive.serde.ParquetInputFormat | NULL | +| OutputFormat: | com.cloudera.impala.hive.serde.ParquetOutputFormat | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | ++------------------------------+----------------------------------------------------+------------+ +Returned 27 row(s) in 0.17s</codeblock> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read and execute + permissions for all directories that are part of the table. + (A table could span multiple different HDFS directories if it is partitioned. + The directories could be widely scattered because a partition can reside + in an arbitrary HDFS directory based on its <codeph>LOCATION</codeph> attribute.) + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_show.xml#show_tables"/>, <xref href="impala_show.xml#show_create_table"/> + </p> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_development.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_development.xml b/docs/topics/impala_development.xml new file mode 100644 index 0000000..1b35046 --- /dev/null +++ b/docs/topics/impala_development.xml @@ -0,0 +1,231 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="intro_dev"> + + <title>Developing Impala Applications</title> + <titlealts audience="PDF"><navtitle>Developing Applications</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The core development language with Impala is SQL. You can also use Java or other languages to interact with + Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For + specialized kinds of analysis, you can supplement the SQL built-in functions by writing + <xref href="impala_udf.xml#udfs">user-defined functions (UDFs)</xref> in C++ or Java. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="intro_sql"> + + <title>Overview of the Impala SQL Dialect</title> + <prolog> + <metadata> + <data name="Category" value="SQL"/> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As + such, it is familiar to users who are already familiar with running SQL queries on the Hadoop + infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in + functions. Impala also includes additional built-in functions for common industry features, to simplify + porting SQL from non-Hadoop systems. + </p> + + <p> + For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect + might seem familiar: + </p> + + <ul> + <li> + <p> + The <xref href="impala_select.xml#select">SELECT statement</xref> includes familiar clauses such as <codeph>WHERE</codeph>, + <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, and <codeph>WITH</codeph>. + You will find familiar notions such as + <xref href="impala_joins.xml#joins">joins</xref>, <xref href="impala_functions.xml#builtins">built-in + functions</xref> for processing strings, numbers, and dates, + <xref href="impala_aggregate_functions.xml#aggregate_functions">aggregate functions</xref>, + <xref href="impala_subqueries.xml#subqueries">subqueries</xref>, and + <xref href="impala_operators.xml#comparison_operators">comparison operators</xref> + such as <codeph>IN()</codeph> and <codeph>BETWEEN</codeph>. + The <codeph>SELECT</codeph> statement is the place where SQL standards compliance is most important. + </p> + </li> + + <li> + <p> + From the data warehousing world, you will recognize the notion of + <xref href="impala_partitioning.xml#partitioning">partitioned tables</xref>. + One or more columns serve as partition keys, and the data is physically arranged so that + queries that refer to the partition key columns in the <codeph>WHERE</codeph> clause + can skip partitions that do not match the filter conditions. For example, if you have 10 + years worth of data and use a clause such as <codeph>WHERE year = 2015</codeph>, + <codeph>WHERE year > 2010</codeph>, or <codeph>WHERE year IN (2014, 2015)</codeph>, + Impala skips all the data for non-matching years, greatly reducing the amount of I/O + for the query. + </p> + </li> + + <li rev="1.2"> + <p> + In Impala 1.2 and higher, <xref href="impala_udf.xml#udfs">UDFs</xref> let you perform custom comparisons + and transformation logic during <codeph>SELECT</codeph> and <codeph>INSERT...SELECT</codeph> statements. + </p> + </li> + </ul> + + <p> + For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect + might require some learning and practice for you to become proficient in the Hadoop environment: + </p> + + <ul> + <li> + <p> + Impala SQL is focused on queries and includes relatively little DML. There is no <codeph>UPDATE</codeph> + or <codeph>DELETE</codeph> statement. Stale data is typically discarded (by <codeph>DROP TABLE</codeph> + or <codeph>ALTER TABLE ... DROP PARTITION</codeph> statements) or replaced (by <codeph>INSERT + OVERWRITE</codeph> statements). + </p> + </li> + + <li> + <p> + All data creation is done by <codeph>INSERT</codeph> statements, which typically insert data in bulk by + querying from other tables. There are two variations, <codeph>INSERT INTO</codeph> which appends to the + existing data, and <codeph>INSERT OVERWRITE</codeph> which replaces the entire contents of a table or + partition (similar to <codeph>TRUNCATE TABLE</codeph> followed by a new <codeph>INSERT</codeph>). + Although there is an <codeph>INSERT ... VALUES</codeph> syntax to create a small number of values in + a single statement, it is far more efficient to use the <codeph>INSERT ... SELECT</codeph> to copy + and transform large amounts of data from one table to another in a single operation. + </p> + </li> + + <li> + <p> + You often construct Impala table definitions and data files in some other environment, and then attach + Impala so that it can run real-time queries. The same data files and table metadata are shared with other + components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data + inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components + can write files in formats such as Parquet and Avro, that can then be queried by Impala. + </p> + </li> + + <li> + <p> + Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL + includes some idioms that you might find in the import utilities for traditional database systems. For + example, you can create a table that reads comma-separated or tab-separated text files, specifying the + separator in the <codeph>CREATE TABLE</codeph> statement. You can create <b>external tables</b> that read + existing data files but do not move or transform them. + </p> + </li> + + <li> + <p> + Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does + not require length constraints on string data types. For example, you can define a database column as + <codeph>STRING</codeph> with unlimited length, rather than <codeph>CHAR(1)</codeph> or + <codeph>VARCHAR(64)</codeph>. <ph rev="2.0.0">(Although in Impala 2.0 and later, you can also use + length-constrained <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types.)</ph> + </p> + </li> + + </ul> + + <p> + <b>Related information:</b> <xref href="impala_langref.xml#langref"/>, especially + <xref href="impala_langref_sql.xml#langref_sql"/> and <xref href="impala_functions.xml#builtins"/> + </p> + </conbody> + </concept> + +<!-- Bunch of potential concept topics for future consideration. Major areas of Impala modelled on areas of discussion for Oracle Database, and distributed databases in general. --> + + <concept id="intro_datatypes" audience="Cloudera"> + + <title>Overview of Impala SQL Data Types</title> + + <conbody/> + </concept> + + <concept id="intro_network" audience="Cloudera"> + + <title>Overview of Impala Network Topology</title> + + <conbody/> + </concept> + + <concept id="intro_cluster" audience="Cloudera"> + + <title>Overview of Impala Cluster Topology</title> + + <conbody/> + </concept> + + <concept id="intro_apis"> + + <title>Overview of Impala Programming Interfaces</title> + <prolog> + <metadata> + <data name="Category" value="JDBC"/> + <data name="Category" value="ODBC"/> + <data name="Category" value="Hue"/> + </metadata> + </prolog> + + <conbody> + + <p> + You can connect and submit requests to the Impala daemons through: + </p> + + <ul> + <li> + The <codeph><xref href="impala_impala_shell.xml#impala_shell">impala-shell</xref></codeph> interactive + command interpreter. + </li> + + <li> + The <xref href="http://gethue.com/" scope="external" format="html">Hue</xref> web-based user interface. + </li> + + <li> + <xref href="impala_jdbc.xml#impala_jdbc">JDBC</xref>. + </li> + + <li> + <xref href="impala_odbc.xml#impala_odbc">ODBC</xref>. + </li> + </ul> + + <p> + With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications + running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence + tools that use the JDBC and ODBC interfaces. + </p> + + <p> + Each <codeph>impalad</codeph> daemon process, running on separate nodes in a cluster, listens to + <xref href="impala_ports.xml#ports">several ports</xref> for incoming requests. Requests from + <codeph>impala-shell</codeph> and Hue are routed to the <codeph>impalad</codeph> daemons through the same + port. The <codeph>impalad</codeph> daemons listen on separate ports for JDBC and ODBC requests. + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_cached_reads.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_cached_reads.xml b/docs/topics/impala_disable_cached_reads.xml new file mode 100644 index 0000000..bce6091 --- /dev/null +++ b/docs/topics/impala_disable_cached_reads.xml @@ -0,0 +1,36 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_cached_reads" rev="1.4.0"> + + <title>DISABLE_CACHED_READS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="HDFS"/> + <data name="Category" value="HDFS Caching"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISABLE_CACHED_READS query option</indexterm> + Prevents Impala from reading data files that are <q>pinned</q> in memory + through the HDFS caching feature. Primarily a debugging option for + cases where processing of HDFS cached data is concentrated on a single + host, leading to excessive CPU usage on that host. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + + <p conref="../shared/impala_common.xml#common/default_false"/> + + <p conref="../shared/impala_common.xml#common/added_in_140"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_codegen.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_codegen.xml b/docs/topics/impala_disable_codegen.xml new file mode 100644 index 0000000..ab306e8 --- /dev/null +++ b/docs/topics/impala_disable_codegen.xml @@ -0,0 +1,38 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_codegen"> + + <title>DISABLE_CODEGEN Query Option</title> + <titlealts audience="PDF"><navtitle>DISABLE_CODEGEN</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Performance"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISABLE_CODEGEN query option</indexterm> + This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query + fails with an <q>illegal instruction</q> or other hardware-specific message, try setting + <codeph>DISABLE_CODEGEN=true</codeph> and running the query again. If the query succeeds only when the + <codeph>DISABLE_CODEGEN</codeph> option is turned on, submit the problem to <keyword keyref="support_org"/> and include that + detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower + overall performance. + </p> + + <p> + Because the code generation phase adds a small amount of overhead for each query, you might turn on the + <codeph>DISABLE_CODEGEN</codeph> option to achieve maximum throughput when running many short-lived queries + against small tables. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_outermost_topn.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_outermost_topn.xml b/docs/topics/impala_disable_outermost_topn.xml new file mode 100644 index 0000000..24fcd2a --- /dev/null +++ b/docs/topics/impala_disable_outermost_topn.xml @@ -0,0 +1,29 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_outermost_topn" rev="2.5.0"> + + <title>DISABLE_OUTERMOST_TOPN Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.5.0"> + <indexterm audience="Cloudera">DISABLE_OUTERMOST_TOPN query option</indexterm> + </p> + + <p> + <b>Type:</b> + </p> + + <p> + <b>Default:</b> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_row_runtime_filtering.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_row_runtime_filtering.xml b/docs/topics/impala_disable_row_runtime_filtering.xml new file mode 100644 index 0000000..a8c07cb --- /dev/null +++ b/docs/topics/impala_disable_row_runtime_filtering.xml @@ -0,0 +1,65 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_row_runtime_filtering" rev="2.5.0"> + + <title>DISABLE_ROW_RUNTIME_FILTERING Query Option (<keyword keyref="impala25"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>DISABLE_ROW_RUNTIME_FILTERING</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.5.0"> + <indexterm audience="Cloudera">DISABLE_ROW_RUNTIME_FILTERING query option</indexterm> + The <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph> query option + reduces the scope of the runtime filtering feature. Queries still dynamically prune + partitions, but do not apply the filtering logic to individual rows within partitions. + </p> + + <p> + Only applies to queries against Parquet tables. For other file formats, Impala + only prunes at the level of partitions, not individual rows. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false"/> + + <p conref="../shared/impala_common.xml#common/added_in_250"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Impala automatically evaluates whether the per-row filters are being + effective at reducing the amount of intermediate data. Therefore, + this option is typically only needed for the rare case where Impala + cannot accurately determine how effective the per-row filtering is + for a query. + </p> + + <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/> + + <p> + Because this setting only improves query performance in very specific + circumstances, depending on the query characteristics and data distribution, + only use it when you determine through benchmarking that it improves + performance of specific expensive queries. + Consider setting this query option immediately before the expensive query and + unsetting it immediately afterward. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_runtime_filtering.xml"/>, + <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> + <!-- , <xref href="impala_partitioning.xml#dynamic_partition_pruning"/> --> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_streaming_preaggregations.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_streaming_preaggregations.xml b/docs/topics/impala_disable_streaming_preaggregations.xml new file mode 100644 index 0000000..a3919af --- /dev/null +++ b/docs/topics/impala_disable_streaming_preaggregations.xml @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_streaming_preaggregations" rev="2.5.0 IMPALA-1305"> + + <title>DISABLE_STREAMING_PREAGGREGATIONS Query Option (<keyword keyref="impala25"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>DISABLE_STREAMING_PREAGGREGATIONS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.5.0 IMPALA-1305"> + <indexterm audience="Cloudera">DISABLE_STREAMING_PREAGGREGATIONS query option</indexterm> + Turns off the <q>streaming preaggregation</q> optimization that is available in <keyword keyref="impala25_full"/> + and higher. This optimization reduces unnecessary work performed by queries that perform aggregation + operations on columns with few or no duplicate values, for example <codeph>DISTINCT <varname>id_column</varname></codeph> + or <codeph>GROUP BY <varname>unique_column</varname></codeph>. If the optimization causes regressions in + existing queries that use aggregation functions, you can turn it off as needed by setting this query option. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <note conref="../shared/impala_common.xml#common/one_but_not_true"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + Typically, queries that would require enabling this option involve very large numbers of + aggregated values, such as a billion or more distinct keys being processed on each + worker node. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_250"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disable_unsafe_spills.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_unsafe_spills.xml b/docs/topics/impala_disable_unsafe_spills.xml new file mode 100644 index 0000000..14b687c --- /dev/null +++ b/docs/topics/impala_disable_unsafe_spills.xml @@ -0,0 +1,53 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="disable_unsafe_spills"> + + <title>DISABLE_UNSAFE_SPILLS Query Option (<keyword keyref="impala20"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>DISABLE_UNSAFE_SPILLS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Scalability"/> + <data name="Category" value="Memory"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.0.0"> + <indexterm audience="Cloudera">DISABLE_UNSAFE_SPILLS query option</indexterm> + Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than + write temporary data to disk. + </p> + + <p> + Queries that <q>spill</q> to disk typically complete successfully, when in earlier Impala releases they would have failed. + However, queries with exorbitant memory requirements due to missing statistics or inefficient join clauses could + become so slow as a result that you would rather have them cancelled automatically and reduce the memory + usage through standard Impala tuning techniques. + </p> + + <p> + This option prevents only <q>unsafe</q> spill operations, meaning that one or more tables are missing + statistics or the query does not include a hint to set the most efficient mechanism for a join or + <codeph>INSERT ... SELECT</codeph> into a partitioned table. These are the tables most likely to result in + suboptimal execution plans that could cause unnecessary spilling. Therefore, leaving this option enabled is a + good way to find tables on which to run the <codeph>COMPUTE STATS</codeph> statement. + </p> + + <p> + See <xref href="impala_scalability.xml#spill_to_disk"/> for information about the <q>spill to disk</q> + feature for queries processing large result sets with joins, <codeph>ORDER BY</codeph>, <codeph>GROUP + BY</codeph>, <codeph>DISTINCT</codeph>, aggregation functions, or analytic functions. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_disk_space.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disk_space.xml b/docs/topics/impala_disk_space.xml new file mode 100644 index 0000000..b6daaeb --- /dev/null +++ b/docs/topics/impala_disk_space.xml @@ -0,0 +1,129 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disk_space"> + + <title>Managing Disk Space for Impala Data</title> + <titlealts audience="PDF"><navtitle>Managing Disk Space</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Disk Storage"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Compression"/> + </metadata> + </prolog> + + <conbody> + + <p> + Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, + there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques + to minimize space consumption and file duplication. + </p> + + <ul> + <li> + <p> + Use compact binary file formats where practical. Numeric and time-based data in particular can be stored + in more compact form in binary data files. Depending on the file format, various compression and encoding + features can reduce file size even further. You can specify the <codeph>STORED AS</codeph> clause as part + of the <codeph>CREATE TABLE</codeph> statement, or <codeph>ALTER TABLE</codeph> with the <codeph>SET + FILEFORMAT</codeph> clause for an existing table or partition within a partitioned table. See + <xref href="impala_file_formats.xml#file_formats"/> for details about file formats, especially + <xref href="impala_parquet.xml#parquet"/>. See <xref href="impala_create_table.xml#create_table"/> and + <xref href="impala_alter_table.xml#alter_table"/> for syntax details. + </p> + </li> + + <li> + <p> + You manage underlying data files differently depending on whether the corresponding Impala table is + defined as an <xref href="impala_tables.xml#internal_tables">internal</xref> or + <xref href="impala_tables.xml#external_tables">external</xref> table: + </p> + <ul> + <li> + Use the <codeph>DESCRIBE FORMATTED</codeph> statement to check if a particular table is internal + (managed by Impala) or external, and to see the physical location of the data files in HDFS. See + <xref href="impala_describe.xml#describe"/> for details. + </li> + + <li> + For Impala-managed (<q>internal</q>) tables, use <codeph>DROP TABLE</codeph> statements to remove + data files. See <xref href="impala_drop_table.xml#drop_table"/> for details. + </li> + + <li> + For tables not managed by Impala (<q>external</q> tables), use appropriate HDFS-related commands such + as <codeph>hadoop fs</codeph>, <codeph>hdfs dfs</codeph>, or <codeph>distcp</codeph>, to create, move, + copy, or delete files within HDFS directories that are accessible by the <codeph>impala</codeph> user. + Issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement after adding or removing any + files from the data directory of an external table. See <xref href="impala_refresh.xml#refresh"/> for + details. + </li> + + <li> + Use external tables to reference HDFS data files in their original location. With this technique, you + avoid copying the files, and you can map more than one Impala table to the same set of data files. When + you drop the Impala table, the data files are left undisturbed. See + <xref href="impala_tables.xml#external_tables"/> for details. + </li> + + <li> + Use the <codeph>LOAD DATA</codeph> statement to move HDFS files into the data directory for an Impala + table from inside Impala, without the need to specify the HDFS path of the destination directory. This + technique works for both internal and external tables. See + <xref href="impala_load_data.xml#load_data"/> for details. + </li> + </ul> + </li> + + <li> + <p> + Make sure that the HDFS trashcan is configured correctly. When you remove files from HDFS, the space + might not be reclaimed for use by other files until sometime later, when the trashcan is emptied. See + <xref href="impala_drop_table.xml#drop_table"/> and the FAQ entry + <xref href="impala_faq.xml#faq_sql/faq_drop_table_space"/> for details. See + <xref href="impala_prereqs.xml#prereqs_account"/> for permissions needed for the HDFS trashcan to operate + correctly. + </p> + </li> + + <li> + <p> + Drop all tables in a database before dropping the database itself. See + <xref href="impala_drop_database.xml#drop_database"/> for details. + </p> + </li> + + <li> + <p> + Clean up temporary files after failed <codeph>INSERT</codeph> statements. If an <codeph>INSERT</codeph> + statement encounters an error, and you see a directory named <filepath>.impala_insert_staging</filepath> + or <filepath>_impala_insert_staging</filepath> left behind in the data directory for the table, it might + contain temporary data files taking up space in HDFS. You might be able to salvage these data files, for + example if they are complete but could not be moved into place due to a permission error. Or, you might + delete those files through commands such as <codeph>hadoop fs</codeph> or <codeph>hdfs dfs</codeph>, to + reclaim space before re-trying the <codeph>INSERT</codeph>. Issue <codeph>DESCRIBE FORMATTED + <varname>table_name</varname></codeph> to see the HDFS path where you can check for temporary files. + </p> + </li> + + <li rev="1.4.0"> + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/> + </li> + + <li rev="2.2.0"> + <p> + If you use the Amazon Simple Storage Service (S3) as a place to offload + data to reduce the volume of local storage, Impala 2.2.0 and higher + can query the data directly from S3. + See <xref href="impala_s3.xml#s3"/> for details. + </p> + </li> + </ul> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_distinct.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_distinct.xml b/docs/topics/impala_distinct.xml new file mode 100644 index 0000000..8661831 --- /dev/null +++ b/docs/topics/impala_distinct.xml @@ -0,0 +1,61 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="distinct"> + + <title>DISTINCT Operator</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISTINCT operator</indexterm> + The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to + remove duplicates: + </p> + +<codeblock>-- Returns the unique values from one column. +-- NULL is included in the set of values if any rows have a NULL in this column. +select distinct c_birth_country from customer; +-- Returns the unique combinations of values from multiple columns. +select distinct c_salutation, c_last_name from customer;</codeblock> + + <p> + You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically + <codeph>COUNT()</codeph>, to find how many different values a column contains: + </p> + +<codeblock>-- Counts the unique values from one column. +-- NULL is not included as a distinct value in the count. +select count(distinct c_birth_country) from customer; +-- Counts the unique combinations of values from multiple columns. +select count(distinct c_salutation, c_last_name) from customer;</codeblock> + + <p> + One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one + aggregation function in the same query. For example, you could not have a single query with both + <codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the + <codeph>SELECT</codeph> list. + </p> + + <p conref="../shared/impala_common.xml#common/zero_length_strings"/> + + <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/> + + <note> + <p> + In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order, + Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER + BY</codeph> clause if you need the values in alphabetical or numeric sorted order. + </p> + </note> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_dml.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_dml.xml b/docs/topics/impala_dml.xml new file mode 100644 index 0000000..b6c6399 --- /dev/null +++ b/docs/topics/impala_dml.xml @@ -0,0 +1,91 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="dml"> + + <title>DML Statements</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Tables"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + DML refers to <q>Data Manipulation Language</q>, a subset of SQL statements that modify the data stored in + tables. Because Impala focuses on query performance and leverages the append-only nature of HDFS storage, + currently Impala only supports a small set of DML statements: + </p> + + <ul> + <li> + <xref keyref="delete"/>. Works for Kudu tables only. + </li> + + <li> + <xref keyref="insert"/>. + </li> + + <li> + <xref keyref="load_data"/>. Does not apply for HBase or Kudu tables. + </li> + + <li> + <xref keyref="update"/>. Works for Kudu tables only. + </li> + + <li> + <xref keyref="upsert"/>. Works for Kudu tables only. + </li> + </ul> + + <p> + <codeph>INSERT</codeph> in Impala is primarily optimized for inserting large volumes of data in a single + statement, to make effective use of the multi-megabyte HDFS blocks. This is the way in Impala to create new + data files. If you intend to insert one or a few rows at a time, such as using the <codeph>INSERT ... + VALUES</codeph> syntax, that technique is much more efficient for Impala tables stored in HBase. See + <xref href="impala_hbase.xml#impala_hbase"/> for details. + </p> + + <p> + <codeph>LOAD DATA</codeph> moves existing data files into the directory for an Impala table, making them + immediately available for Impala queries. This is one way in Impala to work with data files produced by other + Hadoop components. (<codeph>CREATE EXTERNAL TABLE</codeph> is the other alternative; with external tables, + you can query existing data files, while the files remain in their original location.) + </p> + + <p> + In <keyword keyref="impala28_full"/> and higher, Impala does support the <codeph>UPDATE</codeph>, <codeph>DELETE</codeph>, + and <codeph>UPSERT</codeph> statements for Kudu tables. + For HDFS or S3 tables, to simulate the effects of an <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> statement + in other database systems, typically you use <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> to copy data + from one table to another, filtering out or changing the appropriate rows during the copy operation. + </p> + + <p> + You can also achieve a result similar to <codeph>UPDATE</codeph> by using Impala tables stored in HBase. + When you insert a row into an HBase table, and the table + already contains a row with the same value for the key column, the older row is hidden, effectively the same + as a single-row <codeph>UPDATE</codeph>. + </p> + + <p rev="2.6.0"> + Impala can perform DML operations for tables or partitions stored in the Amazon S3 filesystem + with <keyword keyref="impala26_full"/> and higher. See <xref href="impala_s3.xml#s3"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + The other major classifications of SQL statements are data definition language (see + <xref href="impala_ddl.xml#ddl"/>) and queries (see <xref href="impala_select.xml#select"/>). + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_double.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_double.xml b/docs/topics/impala_double.xml new file mode 100644 index 0000000..c69eae2 --- /dev/null +++ b/docs/topics/impala_double.xml @@ -0,0 +1,100 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="double"> + + <title>DOUBLE Data Type</title> + <titlealts audience="PDF"><navtitle>DOUBLE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + A double precision floating-point data type used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER + TABLE</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> DOUBLE</codeblock> + + <p> + <b>Range:</b> 4.94065645841246544e-324d .. 1.79769313486231570e+308, positive or negative + </p> + + <p> + <b>Precision:</b> 15 to 17 significant digits, depending on usage. The number of significant digits does + not depend on the position of the decimal point. + </p> + + <p> + <b>Representation:</b> The values are stored in 8 bytes, using + <xref href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format" scope="external" format="html">IEEE 754 Double Precision Binary Floating Point</xref> format. + </p> + + <p> + <b>Conversions:</b> Impala does not automatically convert <codeph>DOUBLE</codeph> to any other type. You can + use <codeph>CAST()</codeph> to convert <codeph>DOUBLE</codeph> values to <codeph>FLOAT</codeph>, + <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, <codeph>BIGINT</codeph>, + <codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>. You can use exponential + notation in <codeph>DOUBLE</codeph> literals or when casting from <codeph>STRING</codeph>, for example + <codeph>1.0e6</codeph> to represent one million. + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The data type <codeph>REAL</codeph> is an alias for <codeph>DOUBLE</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>CREATE TABLE t1 (x DOUBLE); +SELECT CAST(1000.5 AS DOUBLE); +</codeblock> + + <p conref="../shared/impala_common.xml#common/partitioning_imprecise"/> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_ok"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/internals_8_bytes"/> + +<!-- <p conref="../shared/impala_common.xml#common/added_in_20"/> --> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + +<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. --> + + <p conref="../shared/impala_common.xml#common/sum_double"/> + + <p conref="../shared/impala_common.xml#common/float_double_decimal_caveat"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_math_functions.xml#math_functions"/>, + <xref href="impala_float.xml#float"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_drop_data_source.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_drop_data_source.xml b/docs/topics/impala_drop_data_source.xml new file mode 100644 index 0000000..be6167c --- /dev/null +++ b/docs/topics/impala_drop_data_source.xml @@ -0,0 +1,35 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept audience="Cloudera" rev="1.4.0" id="drop_data_source"> + + <title>DROP DATA SOURCE Statement</title> + <titlealts audience="PDF"><navtitle>DROP DATA SOURCE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DROP DATA SOURCE statement</indexterm> + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_drop_database.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_drop_database.xml b/docs/topics/impala_drop_database.xml new file mode 100644 index 0000000..fb36dca --- /dev/null +++ b/docs/topics/impala_drop_database.xml @@ -0,0 +1,130 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="drop_database"> + + <title>DROP DATABASE Statement</title> + <titlealts audience="PDF"><navtitle>DROP DATABASE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Databases"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DROP DATABASE statement</indexterm> + Removes a database from the system. The physical operations involve removing the metadata for the database + from the metastore, and deleting the corresponding <codeph>*.db</codeph> directory from HDFS. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>DROP (DATABASE|SCHEMA) [IF EXISTS] <varname>database_name</varname> <ph rev="2.3.0">[RESTRICT | CASCADE]</ph>;</codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + By default, the database must be empty before it can be dropped, to avoid losing any data. + </p> + + <p rev="2.3.0"> + In <keyword keyref="impala23_full"/> and higher, you can include the <codeph>CASCADE</codeph> + clause to make Impala drop all tables and other objects in the database before dropping the database itself. + The <codeph>RESTRICT</codeph> clause enforces the original requirement that the database be empty + before being dropped. Because the <codeph>RESTRICT</codeph> behavior is still the default, this + clause is optional. + </p> + + <p rev="2.3.0"> + The automatic dropping resulting from the <codeph>CASCADE</codeph> clause follows the same rules as the + corresponding <codeph>DROP TABLE</codeph>, <codeph>DROP VIEW</codeph>, and <codeph>DROP FUNCTION</codeph> statements. + In particular, the HDFS directories and data files for any external tables are left behind when the + tables are removed. + </p> + + <p> + When you do not use the <codeph>CASCADE</codeph> clause, drop or move all the objects inside the database manually + before dropping the database itself: + </p> + + <ul> + <li> + <p> + Use the <codeph>SHOW TABLES</codeph> statement to locate all tables and views in the database, + and issue <codeph>DROP TABLE</codeph> and <codeph>DROP VIEW</codeph> statements to remove them all. + </p> + </li> + <li> + <p> + Use the <codeph>SHOW FUNCTIONS</codeph> and <codeph>SHOW AGGREGATE FUNCTIONS</codeph> statements + to locate all user-defined functions in the database, and issue <codeph>DROP FUNCTION</codeph> + and <codeph>DROP AGGREGATE FUNCTION</codeph> statements to remove them all. + </p> + </li> + <li> + <p> + To keep tables or views contained by a database while removing the database itself, use + <codeph>ALTER TABLE</codeph> and <codeph>ALTER VIEW</codeph> to move the relevant + objects to a different database before dropping the original database. + </p> + </li> + </ul> + + <p> + You cannot drop the current database, that is, the database your session connected to + either through the <codeph>USE</codeph> statement or the <codeph>-d</codeph> option of <cmdname>impala-shell</cmdname>. + Issue a <codeph>USE</codeph> statement to switch to a different database first. + Because the <codeph>default</codeph> database is always available, issuing + <codeph>USE default</codeph> is a convenient way to leave the current database + before dropping it. + </p> + + <p conref="../shared/impala_common.xml#common/hive_blurb"/> + + <p> + When you drop a database in Impala, the database can no longer be used by Hive. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<!-- Better to conref the same examples in both places. --> + + <p> + See <xref href="impala_create_database.xml#create_database"/> for examples covering <codeph>CREATE + DATABASE</codeph>, <codeph>USE</codeph>, and <codeph>DROP DATABASE</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + + <p conref="../shared/impala_common.xml#common/s3_ddl"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have write + permission for the directory associated with the database. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_databases.xml#databases"/>, <xref href="impala_create_database.xml#create_database"/>, + <xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>, <xref href="impala_drop_table.xml#drop_table"/> + </p> + </conbody> +</concept>
