http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_create_function.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_function.xml b/docs/topics/impala_create_function.xml new file mode 100644 index 0000000..4140289 --- /dev/null +++ b/docs/topics/impala_create_function.xml @@ -0,0 +1,291 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2" id="create_function"> + + <title>CREATE FUNCTION Statement</title> + <titlealts><navtitle>CREATE FUNCTION</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="UDFs"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE FUNCTION statement</indexterm> + Creates a user-defined function (UDF), which you can use to implement custom logic during + <codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + The syntax is different depending on whether you create a scalar UDF, which is called once for each row and + implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by + multiple functions that compute intermediate results across sets of rows. + </p> + + <p> + To create a scalar UDF, issue a <codeph>CREATE FUNCTION</codeph> statement: + </p> + +<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) + RETURNS <varname>return_type</varname> + LOCATION '<varname>hdfs_path</varname>' + SYMBOL='<varname>symbol_or_class</varname>'</codeblock> + + <p> + To create a UDA, issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement: + </p> + +<codeblock>CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) + RETURNS <varname>return_type</varname> + LOCATION '<varname>hdfs_path</varname>' + [INIT_FN='<varname>function</varname>] + UPDATE_FN='<varname>function</varname> + MERGE_FN='<varname>function</varname> + [PREPARE_FN='<varname>function</varname>] + [CLOSEFN='<varname>function</varname>] + <ph rev="2.0.0">[SERIALIZE_FN='<varname>function</varname>]</ph> + [FINALIZE_FN='<varname>function</varname>] +<!-- [INTERMEDIATE <varname>type_spec</varname>] --></codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p> + <b>Varargs notation:</b> + </p> + + <p> + If the underlying implementation of your function accepts a variable number of arguments: + </p> + + <ul> + <li> + The variable arguments must go last in the argument list. + </li> + + <li> + The variable arguments must all be of the same type. + </li> + + <li> + You must include at least one instance of the variable arguments in every function call invoked from SQL. + </li> + + <li> + You designate the variable portion of the argument list in the <codeph>CREATE FUNCTION</codeph> statement + by including <codeph>...</codeph> immediately after the type name of the first variable argument. For + example, to create a function that accepts an <codeph>INT</codeph> argument, followed by a + <codeph>BOOLEAN</codeph>, followed by one or more <codeph>STRING</codeph> arguments, your <codeph>CREATE + FUNCTION</codeph> statement would look like: +<codeblock>CREATE FUNCTION <varname>func_name</varname> (INT, BOOLEAN, STRING ...) + RETURNS <varname>type</varname> LOCATION '<varname>path</varname>' SYMBOL='<varname>entry_point</varname>'; +</codeblock> + </li> + </ul> + + <p> + See <xref href="impala_udf.xml#udf_varargs"/> for how to code the C++ or Java function to accept + variable-length argument lists. + </p> + + <p> + <b>Scalar and aggregate functions:</b> + </p> + + <p> + The simplest kind of user-defined function returns a single scalar value each time it is called, typically + once for each row in the result set. This general kind of function is what is usually meant by UDF. + User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on + the contents of multiple rows. You usually use UDAs in combination with a <codeph>GROUP BY</codeph> clause to + condense a large result set into a smaller one, or even a single row summarizing column values across an + entire table. + </p> + + <p> + You create UDAs by using the <codeph>CREATE AGGREGATE FUNCTION</codeph> syntax. The clauses + <codeph>INIT_FN</codeph>, <codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>, + <ph rev="2.0.0"><codeph>SERIALIZE_FN</codeph>,</ph> <codeph>FINALIZE_FN</codeph>, and + <codeph>INTERMEDIATE</codeph> only apply when you create a UDA rather than a scalar UDF. + </p> + + <p> + The <codeph>*_FN</codeph> clauses specify functions to call at different phases of function processing. + </p> + + <ul> + <li> + <b>Initialize:</b> The function you specify with the <codeph>INIT_FN</codeph> clause does any initial + setup, such as initializing member variables in internal data structures. This function is often a stub for + simple UDAs. You can omit this clause and a default (no-op) function will be used. + </li> + + <li> + <b>Update:</b> The function you specify with the <codeph>UPDATE_FN</codeph> clause is called once for each + row in the original result set, that is, before any <codeph>GROUP BY</codeph> clause is applied. A separate + instance of the function is called for each different value returned by the <codeph>GROUP BY</codeph> + clause. The final argument passed to this function is a pointer, to which you write an updated value based + on its original value and the value of the first argument. + </li> + + <li> + <b>Merge:</b> The function you specify with the <codeph>MERGE_FN</codeph> clause is called an arbitrary + number of times, to combine intermediate values produced by different nodes or different threads as Impala + reads and processes data files in parallel. The final argument passed to this function is a pointer, to + which you write an updated value based on its original value and the value of the first argument. + </li> + + <li rev="2.0.0"> + <b>Serialize:</b> The function you specify with the <codeph>SERIALIZE_FN</codeph> clause frees memory + allocated to intermediate results. It is required if any memory was allocated by the Allocate function in + the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See + <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the + UDA code samples</xref> for details. + </li> + + <li> + <b>Finalize:</b> The function you specify with the <codeph>FINALIZE_FN</codeph> clause does any required + teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly + opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a + default (no-op) function will be used. It is required in UDAs where the final return type is different than + the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or + Merge functions. See + <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the + UDA code samples</xref> for details. + </li> + </ul> + + <p> + If you use a consistent naming convention for each of the underlying functions, Impala can automatically + determine the names based on the first such clause, so the others are optional. + </p> + + <p audience="Cloudera"> + The <codeph>INTERMEDIATE</codeph> clause specifies the data type of intermediate values passed from the + <q>update</q> phase to the <q>merge</q> phase, and from the <q>merge</q> phase to the <q>finalize</q> phase. + You can use any of the existing Impala data types, or the special notation + <codeph>CHAR(<varname>n</varname>)</codeph> to allocate a scratch area of <varname>n</varname> bytes for the + intermediate result. For example, if the different phases of your UDA pass strings to each other but in the + end the function returns a <codeph>BIGINT</codeph> value, you would specify <codeph>INTERMEDIATE + STRING</codeph>. Likewise, if the different phases of your UDA pass 2 separate <codeph>BIGINT</codeph> values + between them (8 bytes each), you would specify <codeph>INTERMEDIATE CHAR(16)</codeph> so that each function + could read from and write to a 16-byte buffer. + </p> + + <p> + For end-to-end examples of UDAs, see <xref href="impala_udf.xml#udfs"/>. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <ul> + <li> + You can write Impala UDFs in either C++ or Java. C++ UDFs are new to Impala, and are the recommended format + for high performance utilizing native code. Java-based UDFs are compatible between Impala and Hive, and are + most suited to reusing existing Hive UDFs. (Impala can run Java-based Hive UDFs but not Hive UDAs.) + </li> + + <li> + The body of the UDF is represented by a <codeph>.so</codeph> or <codeph>.jar</codeph> file, which you store + in HDFS and the <codeph>CREATE FUNCTION</codeph> statement distributes to each Impala node. + </li> + + <li> + Impala calls the underlying code during SQL statement evaluation, as many times as needed to process all + the rows from the result set. All UDFs are assumed to be deterministic, that is, to always return the same + result when passed the same argument values. Impala might or might not skip some invocations of a UDF if + the result value is already known from a previous call. Therefore, do not rely on the UDF being called a + specific number of times, and do not return different result values based on some external factor such as + the current time, a random number function, or an external data source that could be updated while an + Impala query is in progress. + </li> + + <li> + The names of the function arguments in the UDF are not significant, only their number, positions, and data + types. + </li> + + <li> + You can overload the same function name by creating multiple versions of the function, each with a + different argument signature. For security reasons, you cannot make a UDF with the same name as any + built-in function. + </li> + + <li> + In the UDF code, you represent the function return result as a <codeph>struct</codeph>. This + <codeph>struct</codeph> contains 2 fields. The first field is a <codeph>boolean</codeph> representing + whether the value is <codeph>NULL</codeph> or not. (When this field is <codeph>true</codeph>, the return + value is interpreted as <codeph>NULL</codeph>.) The second field is the same type as the specified function + return type, and holds the return value when the function returns something other than + <codeph>NULL</codeph>. + </li> + + <li> + In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure, + followed by references to zero or more <codeph>struct</codeph>s, corresponding to each of the arguments. + Each <codeph>struct</codeph> has the same 2 fields as with the return value, a <codeph>boolean</codeph> + field representing whether the argument is <codeph>NULL</codeph>, and a field of the appropriate type + holding any non-<codeph>NULL</codeph> argument value. + </li> + + <li> + For sample code and build instructions for UDFs, + see <xref href="https://github.com/cloudera/impala/tree/master/be/src/udf_samples" scope="external" format="html">the sample UDFs in the Impala github repo</xref>. + </li> + + <li> + Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all + the Impala nodes. You do not need to manually copy any UDF-related files between servers. + </li> + + <li> + Because Impala currently does not have any <codeph>ALTER FUNCTION</codeph> statement, if you need to rename + a function, move it to a different database, or change its signature or other properties, issue a + <codeph>DROP FUNCTION</codeph> statement for the original function followed by a <codeph>CREATE + FUNCTION</codeph> with the desired properties. + </li> + + <li> + Because each UDF is associated with a particular database, either issue a <codeph>USE</codeph> statement + before doing any <codeph>CREATE FUNCTION</codeph> statements, or specify the name of the function as + <codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types + (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but + not Impala UDFs written in C++. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_udf.xml#udfs"/> for more background information, usage instructions, and examples for + Impala UDFs; <xref href="impala_drop_function.xml#drop_function"/> + </p> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_create_role.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_role.xml b/docs/topics/impala_create_role.xml new file mode 100644 index 0000000..975ce15 --- /dev/null +++ b/docs/topics/impala_create_role.xml @@ -0,0 +1,66 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.4.0" id="create_role"> + + <title>CREATE ROLE Statement (CDH 5.2 or higher only)</title> + <titlealts><navtitle>CREATE ROLE (CDH 5.2 or higher only)</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Sentry"/> + <data name="Category" value="Roles"/> + <!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE ROLE statement</indexterm> +<!-- Copied from Sentry docs. Turn into conref. --> + The <codeph>CREATE ROLE</codeph> statement creates a role to which privileges can be granted. Privileges can + be granted to roles, which can then be assigned to users. A user that has been assigned a role will only be + able to exercise the privileges of that role. Only users that have administrative privileges can create/drop + roles. By default, the <codeph>hive</codeph>, <codeph>impala</codeph> and <codeph>hue</codeph> users have + administrative privileges in Sentry. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CREATE ROLE <varname>role_name</varname> +</codeblock> + + <p conref="../shared/impala_common.xml#common/privileges_blurb"/> + + <p> + Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry + policy file) can use this statement. + </p> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and + <codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph> + and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be + repeated before each role name, unlike the equivalent Hive statements. + </p> + +<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. --> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>, + <xref href="impala_revoke.xml#revoke"/>, <xref href="impala_drop_role.xml#drop_role"/>, + <xref href="impala_show.xml#show"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_create_table.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_table.xml b/docs/topics/impala_create_table.xml new file mode 100644 index 0000000..cdaee4a --- /dev/null +++ b/docs/topics/impala_create_table.xml @@ -0,0 +1,650 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="create_table"> + + <title>CREATE TABLE Statement</title> + <titlealts><navtitle>CREATE TABLE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="HDFS Caching"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + <data audience="impala_next" name="Category" value="Kudu"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE TABLE statement</indexterm> + Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects + such as: + </p> + + <ul> + <li> + Whether the table is internal or external. + </li> + + <li> + The columns and associated data types. + </li> + + <li> + The columns used for physically partitioning the data. + </li> + + <li> + The file format for data files. + </li> + + <li> + The HDFS directory where the data files are located. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + The general syntax for creating a table and specifying its columns is as follows: + </p> + + <p> + <b>Explicit column definitions:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + [(<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + [COMMENT '<varname>table_comment</varname>'] + [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>] + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +</codeblock> + + <p> + <b>Column definitions inferred from data file:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>' + [COMMENT '<varname>table_comment</varname>'] + [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>] + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +data_type: + <varname>primitive_type</varname> + | array_type + | map_type + | struct_type +</codeblock> + + <p> + <b>CREATE TABLE AS SELECT:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname> + [COMMENT '<varname>table_comment</varname>'] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>] + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +AS + <varname>select_statement</varname></codeblock> + +<codeblock>primitive_type: + TINYINT + | SMALLINT + | INT + | BIGINT + | BOOLEAN + | FLOAT + | DOUBLE + <ph rev="1.4.0">| DECIMAL</ph> + | STRING + <ph rev="2.0.0">| CHAR</ph> + <ph rev="2.0.0">| VARCHAR</ph> + | TIMESTAMP + +<ph rev="2.3.0">complex_type: + struct_type + | array_type + | map_type + +struct_type: STRUCT < <varname>name</varname> : <varname>primitive_or_complex_type</varname> [COMMENT '<varname>comment_string</varname>'], ... > + +array_type: ARRAY < <varname>primitive_or_complex_type</varname> > + +map_type: MAP < <varname>primitive_type</varname>, <varname>primitive_or_complex_type</varname> > +</ph> +row_format: + DELIMITED [FIELDS TERMINATED BY '<varname>char</varname>' [ESCAPED BY '<varname>char</varname>']] + [LINES TERMINATED BY '<varname>char</varname>'] + +file_format: + PARQUET + | TEXTFILE + | AVRO + | SEQUENCEFILE + | RCFILE +</codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <!-- Should really have some info up front about all the data types and file formats. + Consider adding here, or at least making inline links to the relevant keywords + in the syntax spec above. --> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + The Impala complex types (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>) + are available in CDH 5.5 / Impala 2.3 and higher. + Because you can nest these types (for example, to make an array of maps or a struct + with an array field), these types are also sometimes referred to as nested types. + See <xref href="impala_complex_types.xml#complex_types"/> for usage details. + </p> + + <!-- This is kind of an obscure and rare usage scenario. Consider moving all the complex type stuff further down + after some of the more common clauses. --> + <p rev="2.3.0"> + Impala can create tables containing complex type columns, with any supported file format. + Because currently Impala can only query complex type columns in Parquet tables, creating + tables with complex type columns and other file formats such as text is of limited use. + For example, you might create a text table including some columns with complex types with Impala, and use Hive + as part of your to ingest the nested type data and copy it to an identical Parquet table. + Or you might create a partitioned table containing complex type columns using one file format, and + use <codeph>ALTER TABLE</codeph> to change the file format of individual partitions to Parquet; Impala + can then query only the Parquet-format partitions in that table. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/> + + <p> + <b>Internal and external tables (EXTERNAL and LOCATION clauses):</b> + </p> + + <p> + By default, Impala creates an <q>internal</q> table, where Impala manages the underlying data files for the + table, and physically deletes the data files when you drop the table. If you specify the + <codeph>EXTERNAL</codeph> clause, Impala treats the table as an <q>external</q> table, where the data files + are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves + the data files in place when you drop the table. For details about internal and external tables, see + <xref href="impala_tables.xml#tables"/>. + </p> + + <p> + Typically, for an external table you include a <codeph>LOCATION</codeph> clause to specify the path to the + HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces + Parquet files in the HDFS directory <filepath>/user/etl/destination</filepath>, you might create an external + table as follows: + </p> + +<codeblock>CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP) + STORED AS PARQUET LOCATION '/user/etl/destination'; +</codeblock> + + <p> + Although the <codeph>EXTERNAL</codeph> and <codeph>LOCATION</codeph> clauses are often specified together, + <codeph>LOCATION</codeph> is optional for external tables, and you can also specify <codeph>LOCATION</codeph> + for internal tables. The difference is all about whether Impala <q>takes control</q> of the underlying data + files and moves them when you rename the table, or deletes them when you drop the table. For more about + internal and external tables and how they interact with the <codeph>LOCATION</codeph> attribute, see + <xref href="impala_tables.xml#tables"/>. + </p> + + <p> + <b>Partitioned tables (PARTITIONED BY clause):</b> + </p> + + <p> + The <codeph>PARTITIONED BY</codeph> clause divides the data files based on the values from one or more + specified columns. Impala queries can use the partition metadata to minimize the amount of data that is read + from disk or transmitted across the network, particularly during join queries. For details about + partitioning, see <xref href="impala_partitioning.xml#partitioning"/>. + </p> + + <p rev="kudu" audience="impala_next"> + <b>Partitioning for Kudu tables (DISTRIBUTE BY clause)</b> + </p> + + <p rev="kudu" audience="impala_next"> + For Kudu tables, you specify logical partitioning across one or more columns using the + <codeph>DISTRIBUTE BY</codeph> clause. In contrast to partitioning for HDFS-based tables, + multiple values for a partition key column can be located in the same partition. + The optional <codeph>HASH</codeph> clause lets you divide one or a set of partition key columns + into a specified number of buckets; you can use more than one <codeph>HASH</codeph> + clause, specifying a distinct set of partition key columns for each. + The optional <codeph>RANGE</codeph> clause further subdivides the partitions, based on + a set of literal values for the partition key columns. + </p> + + <p> + <b>Specifying file format (STORED AS and ROW FORMAT clauses):</b> + </p> + + <p> + The <codeph>STORED AS</codeph> clause identifies the format of the underlying data files. Currently, Impala + can query more types of file formats than it can create or insert into. Use Hive to perform any create or + data load operations that are not currently available in Impala. For example, Impala can create a + SequenceFile table but cannot insert data into it. There are also Impala-specific procedures for using + compression with each kind of file format. For details about working with data files of various formats, see + <xref href="impala_file_formats.xml#file_formats"/>. + </p> + + <note> + In Impala 1.4.0 and higher, Impala can create Avro tables, which formerly required doing the <codeph>CREATE + TABLE</codeph> statement in Hive. See <xref href="impala_avro.xml#avro"/> for details and examples. + </note> + + <p> + By default (when no <codeph>STORED AS</codeph> clause is specified), data files in Impala tables are created + as text files with Ctrl-A (hex 01) characters as the delimiter. +<!-- Verify if ROW FORMAT is entirely ignored outside of text tables, or does it apply somehow to SequenceFile and/or RCFile too? --> + Specify the <codeph>ROW FORMAT DELIMITED</codeph> clause to produce or ingest data files that use a different + delimiter character such as tab or <codeph>|</codeph>, or a different line end character such as carriage + return or newline. When specifying delimiter and line end characters with the <codeph>FIELDS TERMINATED + BY</codeph> and <codeph>LINES TERMINATED BY</codeph> clauses, use <codeph>'\t'</codeph> for tab, + <codeph>'\n'</codeph> for newline or linefeed, <codeph>'\r'</codeph> for carriage return, and + <codeph>\</codeph><codeph>0</codeph> for ASCII <codeph>nul</codeph> (hex 00). For more examples of text + tables, see <xref href="impala_txtfile.xml#txtfile"/>. + </p> + + <p> + The <codeph>ESCAPED BY</codeph> clause applies both to text files that you create through an + <codeph>INSERT</codeph> statement to an Impala <codeph>TEXTFILE</codeph> table, and to existing data files + that you put into an Impala table directory. (You can ingest existing data files either by creating the table + with <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph>, the <codeph>LOAD DATA</codeph> statement, or + through an HDFS operation such as <codeph>hdfs dfs -put <varname>file</varname> + <varname>hdfs_path</varname></codeph>.) Choose an escape character that is not used anywhere else in the + file, and put it in front of each instance of the delimiter character that occurs within a field value. + Surrounding field values with quotation marks does not help Impala to parse fields with embedded delimiter + characters; the quotation marks are considered to be part of the column value. If you want to use + <codeph>\</codeph> as the escape character, specify the clause in <cmdname>impala-shell</cmdname> as + <codeph>ESCAPED BY '\\'</codeph>. + </p> + + <note conref="../shared/impala_common.xml#common/thorn"/> + + <p> + <b>Cloning tables (LIKE clause):</b> + </p> + + <p> + To create an empty table with the same columns, comments, and other attributes as another table, use the + following variation. The <codeph>CREATE TABLE ... LIKE</codeph> form allows a restricted set of clauses, + currently only the <codeph>LOCATION</codeph>, <codeph>COMMENT</codeph>, and <codeph>STORED AS</codeph> + clauses. + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + <ph rev="1.4.0">LIKE { [<varname>db_name</varname>.]<varname>table_name</varname> | PARQUET '<varname>hdfs_path_of_parquet_file</varname>' }</ph> + [COMMENT '<varname>table_comment</varname>'] + [STORED AS <varname>file_format</varname>] + [LOCATION '<varname>hdfs_path</varname>']</codeblock> + + <note rev="1.2"> + To clone the structure of a table and transfer data into it in a single operation, use the <codeph>CREATE + TABLE AS SELECT</codeph> syntax described in the next subsection. + </note> + + <p> + When you clone the structure of an existing table using the <codeph>CREATE TABLE ... LIKE</codeph> syntax, + the new table keeps the same file format as the original one, so you only need to specify the <codeph>STORED + AS</codeph> clause if you want to use a different file format, or when specifying a view as the original + table. (Creating a table <q>like</q> a view produces a text table by default.) + </p> + + <p> + Although normally Impala cannot create an HBase table directly, Impala can clone the structure of an existing + HBase table with the <codeph>CREATE TABLE ... LIKE</codeph> syntax, preserving the file format and metadata + from the original table. + </p> + + <p> + There are some exceptions to the ability to use <codeph>CREATE TABLE ... LIKE</codeph> with an Avro table. + For example, you cannot use this technique for an Avro table that is specified with an Avro schema but no + columns. When in doubt, check if a <codeph>CREATE TABLE ... LIKE</codeph> operation works in Hive; if not, it + typically will not work in Impala either. + </p> + + <p> + If the original table is partitioned, the new table inherits the same partition key columns. Because the new + table is initially empty, it does not inherit the actual partitions that exist in the original one. To create + partitions in the new table, insert data or issue <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/create_table_like_view"/> + + <p> + Because <codeph>CREATE TABLE ... LIKE</codeph> only manipulates table metadata, not the physical data of the + table, issue <codeph>INSERT INTO TABLE</codeph> statements afterward to copy any data from the original table + into the new one, optionally converting the data to a new file format. (For some file formats, Impala can do + a <codeph>CREATE TABLE ... LIKE</codeph> to create the table, but Impala cannot insert data in that file + format; in these cases, you must load the data in Hive. See + <xref href="impala_file_formats.xml#file_formats"/> for details.) + </p> + + <p rev="1.2" id="ctas"> + <b>CREATE TABLE AS SELECT:</b> + </p> + + <p> + The <codeph>CREATE TABLE AS SELECT</codeph> syntax is a shorthand notation to create a table based on column + definitions from another table, and copy data from the source table to the destination table without issuing + any separate <codeph>INSERT</codeph> statement. This idiom is so popular that it has its own acronym, + <q>CTAS</q>. +<!-- + The <codeph>CREATE TABLE AS SELECT</codeph> syntax is as follows: + --> + </p> + +<!-- CREATE TABLE AS <select> now incorporated up higher in the original syntax diagram, + thus commented out here. + Does CTAS only accept a limited subset of clauses? --> + +<!-- +<codeblock rev="1.2">CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname> + [COMMENT '<varname>table_comment</varname>'] + [STORED AS <varname>file_format</varname>] + [LOCATION '<varname>hdfs_path</varname>'] +AS + <varname>select_statement</varname></codeblock> +--> + + <p rev="1.2"> + See <xref href="impala_select.xml#select"/> for details about query syntax for the <codeph>SELECT</codeph> + portion of a <codeph>CREATE TABLE AS SELECT</codeph> statement. + </p> + + <p rev="1.2"> + The newly created table inherits the column names that you select from the original table, which you can + override by specifying column aliases in the query. Any column or table comments from the original table are + not carried over to the new table. + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p rev="1.2"> + For example, the following statements show how you can clone all the data in a table, or a subset of the + columns and/or rows, or reorder columns, rename them, or construct them out of expressions: + </p> + +<codeblock rev="1.2">-- Create new table and copy all data. +CREATE TABLE clone_of_t1 AS SELECT * FROM t1; +-- Same idea as CREATE TABLE LIKE, don't copy any data. +CREATE TABLE empty_clone_of_t1 AS SELECT * FROM t1 WHERE 1=0; +-- Copy some data. +CREATE TABLE subset_of_t1 AS SELECT * FROM t1 WHERE x > 100 AND y LIKE 'A%'; +CREATE TABLE summary_of_t1 AS SELECT c1, sum(c2) AS total, avg(c2) AS average FROM t1 GROUP BY c2; +-- Switch file format. +CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1; +-- Create tables with different column order, names, or types than the original. +CREATE TABLE some_columns_from_t1 AS SELECT c1, c3, c5 FROM t1; +CREATE TABLE reordered_columns_from_t1 AS SELECT c4, c3, c1, c2 FROM t1; +CREATE TABLE synthesized_columns AS SELECT upper(c1) AS all_caps, c2+c3 AS total, "California" AS state FROM t1;</codeblock> + + <p rev="1.2"> + As part of a CTAS operation, you can convert the data to any file format that Impala can write (currently, + <codeph>TEXTFILE</codeph> and <codeph>PARQUET</codeph>). You cannot specify the lower-level properties of a + text table, such as the delimiter. Although you can use a partitioned table as the source and copy data from + it, you cannot specify any partitioning clauses for the new table. + </p> + + <p rev="1.4.0"> + <b>CREATE TABLE LIKE PARQUET:</b> + </p> + + <p rev="1.4.0"> + The variation <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph> + lets you skip the column definitions of the <codeph>CREATE TABLE</codeph> statement. The column names and + data types are automatically configured based on the organization of the specified Parquet data file, which + must already reside in HDFS. You can use a data file located outside the Impala database directories, or a + file from an existing Impala Parquet table; either way, Impala only uses the column definitions from the file + and does not use the HDFS location for the <codeph>LOCATION</codeph> attribute of the new table. (Although + you can also specify the enclosing directory with the <codeph>LOCATION</codeph> attribute, to both use the + same schema as the data file and point the Impala table at the associated directory for querying.) + </p> + + <p rev="1.4.0"> + The following considerations apply when you use the <codeph>CREATE TABLE LIKE PARQUET</codeph> technique: + </p> + + <ul rev="1.4.0"> + <li> + Any column comments from the original table are not preserved in the new table. Each column in the new + table has a comment stating the low-level Parquet field type used to deduce the appropriate SQL column + type. + </li> + + <li> + If you use a data file from a partitioned Impala table, any partition key columns from the original table + are left out of the new table, because they are represented in HDFS directory names rather than stored in + the data file. To preserve the partition information, repeat the same <codeph>PARTITION</codeph> clause as + in the original <codeph>CREATE TABLE</codeph> statement. + </li> + + <li> + The file format of the new table defaults to text, as with other kinds of <codeph>CREATE TABLE</codeph> + statements. To make the new table also use Parquet format, include the clause <codeph>STORED AS + PARQUET</codeph> in the <codeph>CREATE TABLE LIKE PARQUET</codeph> statement. + </li> + + <li> + If the Parquet data file comes from an existing Impala table, currently, any <codeph>TINYINT</codeph> or + <codeph>SMALLINT</codeph> columns are turned into <codeph>INT</codeph> columns in the new table. + Internally, Parquet stores such values as 32-bit integers. + </li> + + <li> + When the destination table uses the Parquet file format, the <codeph>CREATE TABLE AS SELECT</codeph> and + <codeph>INSERT ... SELECT</codeph> statements always create at least one data file, even if the + <codeph>SELECT</codeph> part of the statement does not match any rows. You can use such an empty Parquet + data file as a template for subsequent <codeph>CREATE TABLE LIKE PARQUET</codeph> statements. + </li> + </ul> + + <p> + For more details about creating Parquet tables, and examples of the <codeph>CREATE TABLE LIKE + PARQUET</codeph> syntax, see <xref href="impala_parquet.xml#parquet"/>. + </p> + + <p> + <b>Visibility and Metadata (TBLPROPERTIES and WITH SERDEPROPERTIES clauses):</b> + </p> + + <p rev="1.2"> + You can associate arbitrary items of metadata with a table by specifying the <codeph>TBLPROPERTIES</codeph> + clause. This clause takes a comma-separated list of key-value pairs and stores those items in the metastore + database. You can also change the table properties later with an <codeph>ALTER TABLE</codeph> statement. You + can observe the table properties for different delimiter and escape characters using the <codeph>DESCRIBE + FORMATTED</codeph> command, and change those settings for an existing table with <codeph>ALTER TABLE ... SET + TBLPROPERTIES</codeph>. + </p> + + <p rev="1.2"> + You can also associate SerDes properties with the table by specifying key-value pairs through the + <codeph>WITH SERDEPROPERTIES</codeph> clause. This metadata is not used by Impala, which has its own built-in + serializer and deserializer for the file formats it supports. Particular property values might be needed for + Hive compatibility with certain variations of file formats, particularly Avro. + </p> + + <p> + Some DDL operations that interact with other Hadoop components require specifying particular values in the + <codeph>SERDEPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> fields, such as creating an Avro table or + an HBase table. (You typically create HBase tables in Hive, because they require additional clauses not + currently available in Impala.) +<!-- Haven't got a working example from Lenni, so suppressing this recommendation for now. + The Avro schema properties can be specified through either + <codeph>TBLPROPERTIES</codeph> or <codeph>SERDEPROPERTIES</codeph>; + for best compatibility with future versions of Hive, + use <codeph>SERDEPROPERTIES</codeph> in this case. +--> + </p> + + <p> + To see the column definitions and column comments for an existing table, for example before issuing a + <codeph>CREATE TABLE ... LIKE</codeph> or a <codeph>CREATE TABLE ... AS SELECT</codeph> statement, issue the + statement <codeph>DESCRIBE <varname>table_name</varname></codeph>. To see even more detail, such as the + location of data files and the values for clauses such as <codeph>ROW FORMAT</codeph> and <codeph>STORED + AS</codeph>, issue the statement <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>. + <codeph>DESCRIBE FORMATTED</codeph> is also needed to see any overall table comment (as opposed to individual + column comments). + </p> + + <p> + After creating a table, your <cmdname>impala-shell</cmdname> session or another + <cmdname>impala-shell</cmdname> connected to the same node can immediately query that table. There might be a + brief interval (one statestore heartbeat) before the table can be queried through a different Impala node. To + make the <codeph>CREATE TABLE</codeph> statement return only when the table is recognized by all Impala nodes + in the cluster, enable the <codeph>SYNC_DDL</codeph> query option. + </p> + + <p rev="1.4.0"> + <b>HDFS caching (CACHED IN clause):</b> + </p> + + <p rev="1.4.0"> + If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table + directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching + mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS + caching feature. + </p> + + <p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/> + +<!-- Say something in here about the SHOW statement, e.g. SHOW TABLES, SHOW TABLE/COLUMN STATS, SHOW PARTITIONS. --> + + <p> + <b>Column order</b>: + </p> + + <p> + If you intend to use the table to hold data files produced by some external source, specify the columns in + the same order as they appear in the data files. + </p> + + <p> + If you intend to insert or copy data into the table through Impala, or if you have control over the way + externally produced data files are arranged, use your judgment to specify columns in the most convenient + order: + </p> + + <ul> + <li> + <p> + If certain columns are often <codeph>NULL</codeph>, specify those columns last. You might produce data + files that omit these trailing columns entirely. Impala automatically fills in the <codeph>NULL</codeph> + values if so. + </p> + </li> + + <li> + <p> + If an unpartitioned table will be used as the source for an <codeph>INSERT ... SELECT</codeph> operation + into a partitioned table, specify last in the unpartitioned table any columns that correspond to + partition key columns in the partitioned table, and in the same order as the partition key columns are + declared in the partitioned table. This technique lets you use <codeph>INSERT ... SELECT *</codeph> when + copying data to the partitioned table, rather than specifying each column name individually. + </p> + </li> + + <li> + <p> + If you specify columns in an order that you later discover is suboptimal, you can sometimes work around + the problem without recreating the table. You can create a view that selects columns from the original + table in a permuted order, then do a <codeph>SELECT *</codeph> from the view. When inserting data into a + table, you can specify a permuted order for the inserted columns to match the order in the destination + table. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/hive_blurb"/> + + <p> + Impala queries can make use of metadata about the table and columns, such as the number of rows in a table or + the number of different values in a column. Prior to Impala 1.2.2, to create this metadata, you issued the + <codeph>ANALYZE TABLE</codeph> statement in Hive to gather this information, after creating the table and + loading representative data into it. In Impala 1.2.2 and higher, the <codeph>COMPUTE STATS</codeph> statement + produces these statistics within Impala, without needing to use Hive at all. + </p> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <note> + <p> + The Impala <codeph>CREATE TABLE</codeph> statement cannot create an HBase table, because it currently does + not support the <codeph>STORED BY</codeph> clause needed for HBase tables. Create such tables in Hive, then + query them through Impala. For information on using Impala with HBase tables, see + <xref href="impala_hbase.xml#impala_hbase"/>. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p rev="2.2.0"> + To create a table where the data resides in the Amazon Simple Storage Service (S3), + specify a <codeph>s3a://</codeph> prefix <codeph>LOCATION</codeph> attribute pointing to the data files in S3. + You can use this special <codeph>LOCATION</codeph> syntax when creating an empty table, + but not as part of a <codeph>CREATE TABLE AS SELECT</codeph> statement. + See <xref href="impala_s3.xml#s3"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + The <codeph>CREATE TABLE</codeph> statement for an internal table creates a directory in HDFS. The + <codeph>CREATE EXTERNAL TABLE</codeph> statement associates the table with an existing HDFS directory, and + does not create any new directory in HDFS. To locate the HDFS data directory for a table, issue a + <codeph>DESCRIBE FORMATTED <varname>table</varname></codeph> statement. To examine the contents of that HDFS + directory, use an OS command such as <codeph>hdfs dfs -ls hdfs://<varname>path</varname></codeph>, either + from the OS command line or through the <codeph>shell</codeph> or <codeph>!</codeph> commands in + <cmdname>impala-shell</cmdname>. + </p> + + <p> + The <codeph>CREATE TABLE AS SELECT</codeph> syntax creates data files under the table data directory to hold + any data copied by the <codeph>INSERT</codeph> portion of the statement. (Even if no data is copied, Impala + might create one or more empty data files.) + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + <!-- TBD. --> + </p> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#tables"/>, + <xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_drop_table.xml#drop_table"/>, + <xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>, + <xref href="impala_tables.xml#external_tables"/>, <xref href="impala_compute_stats.xml#compute_stats"/>, + <xref href="impala_sync_ddl.xml#sync_ddl"/>, <xref href="impala_show.xml#show_tables"/>, + <xref href="impala_show.xml#show_create_table"/>, <xref href="impala_describe.xml#describe"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_create_view.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_view.xml b/docs/topics/impala_create_view.xml new file mode 100644 index 0000000..2458279 --- /dev/null +++ b/docs/topics/impala_create_view.xml @@ -0,0 +1,136 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.1" id="create_view"> + + <title>CREATE VIEW Statement</title> + <titlealts><navtitle>CREATE VIEW</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE VIEW statement</indexterm> + The <codeph>CREATE VIEW</codeph> statement lets you create a shorthand abbreviation for a more complicated + query. The base query can involve joins, expressions, reordered columns, column aliases, and other SQL + features that can make a query hard to understand or maintain. + </p> + + <p> + Because a view is purely a logical construct (an alias for a query) with no physical data behind it, + <codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files + in HDFS. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CREATE VIEW [IF NOT EXISTS] <varname>view_name</varname> [(<varname>column_list</varname>)] + AS <varname>select_statement</varname></codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The <codeph>CREATE VIEW</codeph> statement can be useful in scenarios such as the following: + </p> + + <ul> + <li> + To turn even the most lengthy and complicated SQL query into a one-liner. You can issue simple queries + against the view from applications, scripts, or interactive queries in <cmdname>impala-shell</cmdname>. + For example: +<codeblock>select * from <varname>view_name</varname>; +select * from <varname>view_name</varname> order by c1 desc limit 10;</codeblock> + The more complicated and hard-to-read the original query, the more benefit there is to simplifying the + query using a view. + </li> + + <li> + To hide the underlying table and column names, to minimize maintenance problems if those names change. In + that case, you re-create the view using the new names, and all queries that use the view rather than the + underlying tables keep running with no changes. + </li> + + <li> + To experiment with optimization techniques and make the optimized queries available to all applications. + For example, if you find a combination of <codeph>WHERE</codeph> conditions, join order, join hints, and so + on that works the best for a class of queries, you can establish a view that incorporates the + best-performing techniques. Applications can then make relatively simple queries against the view, without + repeating the complicated and optimized logic over and over. If you later find a better way to optimize the + original query, when you re-create the view, all the applications immediately take advantage of the + optimized base query. + </li> + + <li> + To simplify a whole class of related queries, especially complicated queries involving joins between + multiple tables, complicated expressions in the column list, and other SQL syntax that makes the query + difficult to understand and debug. For example, you might create a view that joins several tables, filters + using several <codeph>WHERE</codeph> conditions, and selects several columns from the result set. + Applications might issue queries against this view that only vary in their <codeph>LIMIT</codeph>, + <codeph>ORDER BY</codeph>, and similar simple clauses. + </li> + </ul> + + <p> + For queries that require repeating complicated clauses over and over again, for example in the select list, + <codeph>ORDER BY</codeph>, and <codeph>GROUP BY</codeph> clauses, you can use the <codeph>WITH</codeph> + clause as an alternative to creating a view. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + <p conref="../shared/impala_common.xml#common/complex_types_views"/> + <p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<!-- TK: Elaborate on these, show queries and real output. --> + +<codeblock>-- Create a view that is exactly the same as the underlying table. +create view v1 as select * from t1; + +-- Create a view that includes only certain columns from the underlying table. +create view v2 as select c1, c3, c7 from t1; + +-- Create a view that filters the values from the underlying table. +create view v3 as select distinct c1, c3, c7 from t1 where c1 is not null and c5 > 0; + +-- Create a view that that reorders and renames columns from the underlying table. +create view v4 as select c4 as last_name, c6 as address, c2 as birth_date from t1; + +-- Create a view that runs functions to convert or transform certain columns. +create view v5 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1; + +-- Create a view that hides the complexity of a view query. +create view v6 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id; +</codeblock> + +<!-- These examples show CREATE VIEW and corresponding DROP VIEW statements, with different combinations + of qualified and unqualified names. --> + + <p conref="../shared/impala_common.xml#common/create_drop_view_examples"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_views.xml#views"/>, <xref href="impala_alter_view.xml#alter_view"/>, + <xref href="impala_drop_view.xml#drop_view"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_databases.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_databases.xml b/docs/topics/impala_databases.xml new file mode 100644 index 0000000..ad0511f --- /dev/null +++ b/docs/topics/impala_databases.xml @@ -0,0 +1,65 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="databases"> + + <title>Overview of Impala Databases</title> + <titlealts><navtitle>Databases</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Databases"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + In Impala, a database is a logical container for a group of tables. Each database defines a separate + namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different + databases can contain tables with identical names. + </p> + + <p> + Creating a database is a lightweight operation. There are minimal database-specific properties to configure, + only <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>. There is no <codeph>ALTER DATABASE</codeph> statement. + </p> + + <p> + Typically, you create a separate database for each project or application, to avoid naming conflicts between + tables and to make clear which tables are related to each other. The <codeph>USE</codeph> statement lets + you switch between databases. Unqualified references to tables, views, and functions refer to objects + within the current database. You can also refer to objects in other databases by using qualified names + of the form <codeph><varname>dbname</varname>.<varname>object_name</varname></codeph>. + </p> + + <p> + Each database is physically represented by a directory in HDFS. When you do not specify a <codeph>LOCATION</codeph> + attribute, the directory is located in the Impala data directory with the associated tables managed by Impala. + When you do specify a <codeph>LOCATION</codeph> attribute, any read and write operations for tables in that + database are relative to the specified HDFS directory. + </p> + + <p> + There is a special database, named <codeph>default</codeph>, where you begin when you connect to Impala. + Tables created in <codeph>default</codeph> are physically located one level higher in HDFS than all the + user-created databases. + </p> + + <p conref="../shared/impala_common.xml#common/builtins_db"/> + + <p> + <b>Related statements:</b> + </p> + + <p> + <xref href="impala_create_database.xml#create_database"/>, + <xref href="impala_drop_database.xml#drop_database"/>, <xref href="impala_use.xml#use"/>, + <xref href="impala_show.xml#show_databases"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_datatypes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_datatypes.xml b/docs/topics/impala_datatypes.xml new file mode 100644 index 0000000..e45867e --- /dev/null +++ b/docs/topics/impala_datatypes.xml @@ -0,0 +1,43 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="datatypes"> + + <title>Data Types</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">data types</indexterm> + Impala supports a set of data types that you can use for table columns, expression values, and function + arguments and return values. + </p> + + <note> + Currently, Impala supports only scalar types, not composite or nested types. Accessing a table containing any + columns with unsupported types causes an error. + </note> + + <p outputclass="toc"/> + + <p> + For the notation to write literals of each of these data types, see + <xref href="impala_literals.xml#literals"/>. + </p> + + <p> + See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/> for differences between Impala and + Hive data types. + </p> + </conbody> +</concept>
