http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_function.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_function.xml b/docs/topics/impala_create_function.xml new file mode 100644 index 0000000..5b81f50 --- /dev/null +++ b/docs/topics/impala_create_function.xml @@ -0,0 +1,492 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2" id="create_function"> + + <title>CREATE FUNCTION Statement</title> + <titlealts audience="PDF"><navtitle>CREATE FUNCTION</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="UDFs"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE FUNCTION statement</indexterm> + Creates a user-defined function (UDF), which you can use to implement custom logic during + <codeph>SELECT</codeph> or <codeph>INSERT</codeph> operations. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + The syntax is different depending on whether you create a scalar UDF, which is called once for each row and + implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by + multiple functions that compute intermediate results across sets of rows. + </p> + + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + In <keyword keyref="impala25_full"/> and higher, the syntax is also different for creating or dropping scalar Java-based UDFs. + The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs + created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently + between Impala and Hive. + </p> + + <p> + To create a persistent scalar C++ UDF with <codeph>CREATE FUNCTION</codeph>: + </p> + +<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) + RETURNS <varname>return_type</varname> + LOCATION '<varname>hdfs_path_to_dot_so</varname>' + SYMBOL='<varname>symbol_name</varname>'</codeblock> + + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + To create a persistent Java UDF with <codeph>CREATE FUNCTION</codeph>: +<codeblock>CREATE FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname> + LOCATION '<varname>hdfs_path_to_jar</varname>' + SYMBOL='<varname>class_name</varname>'</codeblock> + </p> + +<!-- +Examples: +CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf'; +CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2'; +DROP FUNCTION foo; +DROP FUNCTION IF EXISTS bar; +--> + + <p> + To create a persistent UDA, which must be written in C++, issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement: + </p> + +<codeblock>CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>function_name</varname>([<varname>arg_type</varname>[, <varname>arg_type</varname>...]) + RETURNS <varname>return_type</varname> + LOCATION '<varname>hdfs_path</varname>' + [INIT_FN='<varname>function</varname>] + UPDATE_FN='<varname>function</varname> + MERGE_FN='<varname>function</varname> + [PREPARE_FN='<varname>function</varname>] + [CLOSEFN='<varname>function</varname>] + <ph rev="2.0.0">[SERIALIZE_FN='<varname>function</varname>]</ph> + [FINALIZE_FN='<varname>function</varname>] + <ph rev="2.3.0 IMPALA-1829 CDH-30572">[INTERMEDIATE <varname>type_spec</varname>]</ph></codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p> + <b>Varargs notation:</b> + </p> + + <note rev="CDH-39271 CDH-38572"> + <p rev="CDH-39271 CDH-38572"> + Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs. + </p> + </note> + + <p> + If the underlying implementation of your function accepts a variable number of arguments: + </p> + + <ul> + <li> + The variable arguments must go last in the argument list. + </li> + + <li> + The variable arguments must all be of the same type. + </li> + + <li> + You must include at least one instance of the variable arguments in every function call invoked from SQL. + </li> + + <li> + You designate the variable portion of the argument list in the <codeph>CREATE FUNCTION</codeph> statement + by including <codeph>...</codeph> immediately after the type name of the first variable argument. For + example, to create a function that accepts an <codeph>INT</codeph> argument, followed by a + <codeph>BOOLEAN</codeph>, followed by one or more <codeph>STRING</codeph> arguments, your <codeph>CREATE + FUNCTION</codeph> statement would look like: +<codeblock>CREATE FUNCTION <varname>func_name</varname> (INT, BOOLEAN, STRING ...) + RETURNS <varname>type</varname> LOCATION '<varname>path</varname>' SYMBOL='<varname>entry_point</varname>'; +</codeblock> + </li> + </ul> + + <p rev="CDH-39271 CDH-38572"> + See <xref href="impala_udf.xml#udf_varargs"/> for how to code a C++ UDF to accept + variable-length argument lists. + </p> + + <p> + <b>Scalar and aggregate functions:</b> + </p> + + <p> + The simplest kind of user-defined function returns a single scalar value each time it is called, typically + once for each row in the result set. This general kind of function is what is usually meant by UDF. + User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on + the contents of multiple rows. You usually use UDAs in combination with a <codeph>GROUP BY</codeph> clause to + condense a large result set into a smaller one, or even a single row summarizing column values across an + entire table. + </p> + + <p> + You create UDAs by using the <codeph>CREATE AGGREGATE FUNCTION</codeph> syntax. The clauses + <codeph>INIT_FN</codeph>, <codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>, + <ph rev="2.0.0"><codeph>SERIALIZE_FN</codeph>,</ph> <codeph>FINALIZE_FN</codeph>, and + <codeph>INTERMEDIATE</codeph> only apply when you create a UDA rather than a scalar UDF. + </p> + + <p> + The <codeph>*_FN</codeph> clauses specify functions to call at different phases of function processing. + </p> + + <ul> + <li> + <b>Initialize:</b> The function you specify with the <codeph>INIT_FN</codeph> clause does any initial + setup, such as initializing member variables in internal data structures. This function is often a stub for + simple UDAs. You can omit this clause and a default (no-op) function will be used. + </li> + + <li> + <b>Update:</b> The function you specify with the <codeph>UPDATE_FN</codeph> clause is called once for each + row in the original result set, that is, before any <codeph>GROUP BY</codeph> clause is applied. A separate + instance of the function is called for each different value returned by the <codeph>GROUP BY</codeph> + clause. The final argument passed to this function is a pointer, to which you write an updated value based + on its original value and the value of the first argument. + </li> + + <li> + <b>Merge:</b> The function you specify with the <codeph>MERGE_FN</codeph> clause is called an arbitrary + number of times, to combine intermediate values produced by different nodes or different threads as Impala + reads and processes data files in parallel. The final argument passed to this function is a pointer, to + which you write an updated value based on its original value and the value of the first argument. + </li> + + <li rev="2.0.0"> + <b>Serialize:</b> The function you specify with the <codeph>SERIALIZE_FN</codeph> clause frees memory + allocated to intermediate results. It is required if any memory was allocated by the Allocate function in + the Init, Update, or Merge functions, or if the intermediate type contains any pointers. See + <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the + UDA code samples</xref> for details. + </li> + + <li> + <b>Finalize:</b> The function you specify with the <codeph>FINALIZE_FN</codeph> clause does any required + teardown for resources acquired by your UDF, such as freeing memory, closing file handles if you explicitly + opened any files, and so on. This function is often a stub for simple UDAs. You can omit this clause and a + default (no-op) function will be used. It is required in UDAs where the final return type is different than + the intermediate type. or if any memory was allocated by the Allocate function in the Init, Update, or + Merge functions. See + <xref href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc" scope="external" format="html">the + UDA code samples</xref> for details. + </li> + </ul> + + <p> + If you use a consistent naming convention for each of the underlying functions, Impala can automatically + determine the names based on the first such clause, so the others are optional. + </p> + + <p audience="Cloudera"> + The <codeph>INTERMEDIATE</codeph> clause specifies the data type of intermediate values passed from the + <q>update</q> phase to the <q>merge</q> phase, and from the <q>merge</q> phase to the <q>finalize</q> phase. + You can use any of the existing Impala data types, or the special notation + <codeph>CHAR(<varname>n</varname>)</codeph> to allocate a scratch area of <varname>n</varname> bytes for the + intermediate result. For example, if the different phases of your UDA pass strings to each other but in the + end the function returns a <codeph>BIGINT</codeph> value, you would specify <codeph>INTERMEDIATE + STRING</codeph>. Likewise, if the different phases of your UDA pass 2 separate <codeph>BIGINT</codeph> values + between them (8 bytes each), you would specify <codeph>INTERMEDIATE CHAR(16)</codeph> so that each function + could read from and write to a 16-byte buffer. + </p> + + <p> + For end-to-end examples of UDAs, see <xref href="impala_udf.xml#udfs"/>. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <ul> + <li> + You can write Impala UDFs in either C++ or Java. C++ UDFs are new to Impala, and are the recommended format + for high performance utilizing native code. Java-based UDFs are compatible between Impala and Hive, and are + most suited to reusing existing Hive UDFs. (Impala can run Java-based Hive UDFs but not Hive UDAs.) + </li> + + <li rev="2.5.0 IMPALA-1748 CDH-38369 IMPALA-2843 CDH-39148"> + <keyword keyref="impala25_full"/> introduces UDF improvements to persistence for both C++ and Java UDFs, + and better compatibility between Impala and Hive for Java UDFs. + See <xref href="impala_udf.xml#udfs"/> for details. + </li> + + <li> + The body of the UDF is represented by a <codeph>.so</codeph> or <codeph>.jar</codeph> file, which you store + in HDFS and the <codeph>CREATE FUNCTION</codeph> statement distributes to each Impala node. + </li> + + <li> + Impala calls the underlying code during SQL statement evaluation, as many times as needed to process all + the rows from the result set. All UDFs are assumed to be deterministic, that is, to always return the same + result when passed the same argument values. Impala might or might not skip some invocations of a UDF if + the result value is already known from a previous call. Therefore, do not rely on the UDF being called a + specific number of times, and do not return different result values based on some external factor such as + the current time, a random number function, or an external data source that could be updated while an + Impala query is in progress. + </li> + + <li> + The names of the function arguments in the UDF are not significant, only their number, positions, and data + types. + </li> + + <li> + You can overload the same function name by creating multiple versions of the function, each with a + different argument signature. For security reasons, you cannot make a UDF with the same name as any + built-in function. + </li> + + <li> + In the UDF code, you represent the function return result as a <codeph>struct</codeph>. This + <codeph>struct</codeph> contains 2 fields. The first field is a <codeph>boolean</codeph> representing + whether the value is <codeph>NULL</codeph> or not. (When this field is <codeph>true</codeph>, the return + value is interpreted as <codeph>NULL</codeph>.) The second field is the same type as the specified function + return type, and holds the return value when the function returns something other than + <codeph>NULL</codeph>. + </li> + + <li> + In the UDF code, you represent the function arguments as an initial pointer to a UDF context structure, + followed by references to zero or more <codeph>struct</codeph>s, corresponding to each of the arguments. + Each <codeph>struct</codeph> has the same 2 fields as with the return value, a <codeph>boolean</codeph> + field representing whether the argument is <codeph>NULL</codeph>, and a field of the appropriate type + holding any non-<codeph>NULL</codeph> argument value. + </li> + + <li> + For sample code and build instructions for UDFs, + see <xref href="https://github.com/cloudera/impala/tree/master/be/src/udf_samples" scope="external" format="html">the sample UDFs in the Impala github repo</xref>. + </li> + + <li> + Because the file representing the body of the UDF is stored in HDFS, it is automatically available to all + the Impala nodes. You do not need to manually copy any UDF-related files between servers. + </li> + + <li> + Because Impala currently does not have any <codeph>ALTER FUNCTION</codeph> statement, if you need to rename + a function, move it to a different database, or change its signature or other properties, issue a + <codeph>DROP FUNCTION</codeph> statement for the original function followed by a <codeph>CREATE + FUNCTION</codeph> with the desired properties. + </li> + + <li> + Because each UDF is associated with a particular database, either issue a <codeph>USE</codeph> statement + before doing any <codeph>CREATE FUNCTION</codeph> statements, or specify the name of the function as + <codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types + (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but + not Impala UDFs written in C++. + </p> + + <p conref="../shared/impala_common.xml#common/current_user_caveat"/> + + <p><b>Persistence:</b></p> + + <p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + For additional examples of all kinds of user-defined functions, see <xref href="impala_udf.xml#udfs"/>. + </p> + + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + The following example shows how to take a Java jar file and make all the functions inside one of its classes + into UDFs under a single (overloaded) function name in Impala. Each <codeph>CREATE FUNCTION</codeph> or + <codeph>DROP FUNCTION</codeph> statement applies to all the overloaded Java functions with the same name. + This example uses the signatureless syntax for <codeph>CREATE FUNCTION</codeph> and <codeph>DROP FUNCTION</codeph>, + which is available in <keyword keyref="impala25_full"/> and higher. + </p> + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is + available for Impala to reference through the <codeph>CREATE FUNCTION</codeph> statement and + queries that refer to the Impala function name. + </p> +<codeblock rev="2.5.0 IMPALA-2843 CDH-39148"> +$ jar -tvf udf-examples-cdh570.jar + 0 Mon Feb 22 04:06:50 PST 2016 META-INF/ + 122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF + 0 Mon Feb 22 04:06:46 PST 2016 com/ + 0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/ + 0 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/ + 2460 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/IncompatibleUdfTest.class + 541 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdfException.class + 3438 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/JavaUdfTest.class + 5872 Mon Feb 22 04:06:46 PST 2016 com/cloudera/impala/TestUdf.class +... +$ hdfs dfs -put udf-examples-cdh570.jar /user/impala/udfs +$ hdfs dfs -ls /user/impala/udfs +Found 2 items +-rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar +-rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples-cdh570.jar +</codeblock> + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + In <cmdname>impala-shell</cmdname>, the <codeph>CREATE FUNCTION</codeph> refers to the HDFS path of the jar file + and the fully qualified class name inside the jar. Each of the functions inside the class becomes an + Impala function, each one overloaded under the specified Impala function name. + </p> +<codeblock rev="2.5.0 IMPALA-2843 CDH-39148"> +[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples-cdh570.jar' symbol='com.cloudera.impala.TestUdf'; +[localhost:21000] > show functions; ++-------------+---------------------------------------+-------------+---------------+ +| return type | signature | binary type | is persistent | ++-------------+---------------------------------------+-------------+---------------+ +| BIGINT | testudf(BIGINT) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true | +| DOUBLE | testudf(DOUBLE) | JAVA | true | +| DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true | +| DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true | +| FLOAT | testudf(FLOAT) | JAVA | true | +| FLOAT | testudf(FLOAT, FLOAT) | JAVA | true | +| FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true | +| INT | testudf(INT) | JAVA | true | +| DOUBLE | testudf(INT, DOUBLE) | JAVA | true | +| INT | testudf(INT, INT) | JAVA | true | +| INT | testudf(INT, INT, INT) | JAVA | true | +| SMALLINT | testudf(SMALLINT) | JAVA | true | +| SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true | +| SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true | +| STRING | testudf(STRING) | JAVA | true | +| STRING | testudf(STRING, STRING) | JAVA | true | +| STRING | testudf(STRING, STRING, STRING) | JAVA | true | +| TINYINT | testudf(TINYINT) | JAVA | true | ++-------------+---------------------------------------+-------------+---------------+ +</codeblock> + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + These are all simple functions that return their single arguments, or + sum, concatenate, and so on their multiple arguments. Impala determines which + overloaded function to use based on the number and types of the arguments. + </p> +<codeblock rev="2.5.0 IMPALA-2843 CDH-39148"> +insert into bigint_x values (1), (2), (4), (3); +select testudf(x) from bigint_x; ++-----------------+ +| udfs.testudf(x) | ++-----------------+ +| 1 | +| 2 | +| 4 | +| 3 | ++-----------------+ + +insert into int_x values (1), (2), (4), (3); +select testudf(x, x+1, x*x) from int_x; ++-------------------------------+ +| udfs.testudf(x, x + 1, x * x) | ++-------------------------------+ +| 4 | +| 9 | +| 25 | +| 16 | ++-------------------------------+ + +select testudf(x) from string_x; ++-----------------+ +| udfs.testudf(x) | ++-----------------+ +| one | +| two | +| four | +| three | ++-----------------+ +select testudf(x,x) from string_x; ++--------------------+ +| udfs.testudf(x, x) | ++--------------------+ +| oneone | +| twotwo | +| fourfour | +| threethree | ++--------------------+ +</codeblock> + + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + The previous example used the same Impala function name as the name of the class. + This example shows how the Impala function name is independent of the underlying + Java class or function names. A second <codeph>CREATE FUNCTION</codeph> statement + results in a set of overloaded functions all named <codeph>my_func</codeph>, + to go along with the overloaded functions all named <codeph>testudf</codeph>. + </p> +<codeblock rev="2.5.0 IMPALA-2843 CDH-39148"> +create function my_func location '/user/impala/udfs/udf-examples-cdh570.jar' + symbol='com.cloudera.impala.TestUdf'; + +show functions; ++-------------+---------------------------------------+-------------+---------------+ +| return type | signature | binary type | is persistent | ++-------------+---------------------------------------+-------------+---------------+ +| BIGINT | my_func(BIGINT) | JAVA | true | +| BOOLEAN | my_func(BOOLEAN) | JAVA | true | +| BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true | +... +| BIGINT | testudf(BIGINT) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | +... +</codeblock> + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + The corresponding <codeph>DROP FUNCTION</codeph> statement with no signature + drops all the overloaded functions with that name. + </p> +<codeblock rev="2.5.0 IMPALA-2843 CDH-39148"> +drop function my_func; +show functions; ++-------------+---------------------------------------+-------------+---------------+ +| return type | signature | binary type | is persistent | ++-------------+---------------------------------------+-------------+---------------+ +| BIGINT | testudf(BIGINT) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN) | JAVA | true | +| BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | +... +</codeblock> + <p rev="2.5.0 IMPALA-2843 CDH-39148"> + The signatureless <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs ensures that + the functions shown in this example remain available after the Impala service + (specifically, the Catalog Server) are restarted. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_udf.xml#udfs"/> for more background information, usage instructions, and examples for + Impala UDFs; <xref href="impala_drop_function.xml#drop_function"/> + </p> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_role.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_role.xml b/docs/topics/impala_create_role.xml new file mode 100644 index 0000000..2308490 --- /dev/null +++ b/docs/topics/impala_create_role.xml @@ -0,0 +1,70 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.4.0" id="create_role"> + + <title>CREATE ROLE Statement (<keyword keyref="impala20"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>CREATE ROLE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Sentry"/> + <data name="Category" value="Security"/> + <data name="Category" value="Roles"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE ROLE statement</indexterm> +<!-- Copied from Sentry docs. Turn into conref. --> + The <codeph>CREATE ROLE</codeph> statement creates a role to which privileges can be granted. Privileges can + be granted to roles, which can then be assigned to users. A user that has been assigned a role will only be + able to exercise the privileges of that role. Only users that have administrative privileges can create/drop + roles. By default, the <codeph>hive</codeph>, <codeph>impala</codeph> and <codeph>hue</codeph> users have + administrative privileges in Sentry. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CREATE ROLE <varname>role_name</varname> +</codeblock> + + <p conref="../shared/impala_common.xml#common/privileges_blurb"/> + + <p> + Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry + policy file) can use this statement. + </p> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + Impala makes use of any roles and privileges specified by the <codeph>GRANT</codeph> and + <codeph>REVOKE</codeph> statements in Hive, and Hive makes use of any roles and privileges specified by the + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala. The Impala <codeph>GRANT</codeph> + and <codeph>REVOKE</codeph> statements for privileges do not require the <codeph>ROLE</codeph> keyword to be + repeated before each role name, unlike the equivalent Hive statements. + </p> + +<!-- To do: nail down the new SHOW syntax, e.g. SHOW ROLES, SHOW CURRENT ROLES, SHOW GROUPS. --> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>, + <xref href="impala_revoke.xml#revoke"/>, <xref href="impala_drop_role.xml#drop_role"/>, + <xref href="impala_show.xml#show"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_table.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_table.xml b/docs/topics/impala_create_table.xml new file mode 100644 index 0000000..ba83fc5 --- /dev/null +++ b/docs/topics/impala_create_table.xml @@ -0,0 +1,832 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="create_table" outputclass="impala sql_statement"> + + <title outputclass="impala_title sql_statement_title">CREATE TABLE Statement</title> + <titlealts audience="PDF"><navtitle>CREATE TABLE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="HDFS Caching"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="S3"/> + <!-- <data name="Category" value="Kudu"/> --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE TABLE statement</indexterm> + Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects + such as: + </p> + + <ul> + <li> + Whether the table is internal or external. + </li> + + <li> + The columns and associated data types. + </li> + + <li> + The columns used for physically partitioning the data. + </li> + + <li> + The file format for data files. + </li> + + <li> + The HDFS directory where the data files are located. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + The general syntax for creating a table and specifying its columns is as follows: + </p> + + <p> + <b>Explicit column definitions:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...) + [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + [COMMENT '<varname>table_comment</varname>'] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>] + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +</codeblock> + + <p> + <b>Column definitions inferred from data file:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>' + [COMMENT '<varname>table_comment</varname>'] + [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] [STORED AS <varname>file_format</varname>] + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +data_type: + <varname>primitive_type</varname> + | array_type + | map_type + | struct_type +</codeblock> + + <p> + <b>CREATE TABLE AS SELECT:</b> + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname> + <ph rev="2.5.0">[PARTITIONED BY (<varname>col_name</varname>[, ...])]</ph> + [COMMENT '<varname>table_comment</varname>'] + [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] + [ + [ROW FORMAT <varname>row_format</varname>] <ph rev="CDH-41501">[STORED AS <varname>ctas_file_format</varname>]</ph> + ] + [LOCATION '<varname>hdfs_path</varname>'] + [TBLPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] +<ph rev="1.4.0"> [CACHED IN '<varname>pool_name</varname>'</ph> <ph rev="2.2.0">[WITH REPLICATION = <varname>integer</varname>]</ph> | UNCACHED] +AS + <varname>select_statement</varname></codeblock> + +<codeblock>primitive_type: + TINYINT + | SMALLINT + | INT + | BIGINT + | BOOLEAN + | FLOAT + | DOUBLE + <ph rev="1.4.0">| DECIMAL</ph> + | STRING + <ph rev="2.0.0">| CHAR</ph> + <ph rev="2.0.0">| VARCHAR</ph> + | TIMESTAMP + +<ph rev="2.3.0">complex_type: + struct_type + | array_type + | map_type + +struct_type: STRUCT < <varname>name</varname> : <varname>primitive_or_complex_type</varname> [COMMENT '<varname>comment_string</varname>'], ... > + +array_type: ARRAY < <varname>primitive_or_complex_type</varname> > + +map_type: MAP < <varname>primitive_type</varname>, <varname>primitive_or_complex_type</varname> > +</ph> +row_format: + DELIMITED [FIELDS TERMINATED BY '<varname>char</varname>' [ESCAPED BY '<varname>char</varname>']] + [LINES TERMINATED BY '<varname>char</varname>'] + +file_format: + PARQUET + | TEXTFILE + | AVRO + | SEQUENCEFILE + | RCFILE + +<ph rev="CDH-41501">ctas_file_format: + PARQUET + | TEXTFILE</ph> +</codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <!-- Should really have some info up front about all the data types and file formats. + Consider adding here, or at least making inline links to the relevant keywords + in the syntax spec above. --> + + <p> + <b>Column definitions:</b> + </p> + + <p> + Depending on the form of the <codeph>CREATE TABLE</codeph> statement, the column definitions are + required or not allowed. + </p> + + <p> + With the <codeph>CREATE TABLE AS SELECT</codeph> and <codeph>CREATE TABLE LIKE</codeph> + syntax, you do not specify the columns at all; the column names and types are derived from the source table, query, + or data file. + </p> + + <p> + With the basic <codeph>CREATE TABLE</codeph> syntax, you must list one or more columns, + its name, type, and optionally a comment, in addition to any columns used as partitioning keys. + There is one exception where the column list is not required: when creating an Avro table with the + <codeph>STORED AS AVRO</codeph> clause, you can omit the list of columns and specify the same metadata + as part of the <codeph>TBLPROPERTIES</codeph> clause. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + The Impala complex types (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>) + are available in <keyword keyref="impala23_full"/> and higher. + Because you can nest these types (for example, to make an array of maps or a struct + with an array field), these types are also sometimes referred to as nested types. + See <xref href="impala_complex_types.xml#complex_types"/> for usage details. + </p> + + <!-- This is kind of an obscure and rare usage scenario. Consider moving all the complex type stuff further down + after some of the more common clauses. --> + <p rev="2.3.0"> + Impala can create tables containing complex type columns, with any supported file format. + Because currently Impala can only query complex type columns in Parquet tables, creating + tables with complex type columns and other file formats such as text is of limited use. + For example, you might create a text table including some columns with complex types with Impala, and use Hive + as part of your to ingest the nested type data and copy it to an identical Parquet table. + Or you might create a partitioned table containing complex type columns using one file format, and + use <codeph>ALTER TABLE</codeph> to change the file format of individual partitions to Parquet; Impala + can then query only the Parquet-format partitions in that table. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_partitioning"/> + + <p> + <b>Internal and external tables (EXTERNAL and LOCATION clauses):</b> + </p> + + <p> + By default, Impala creates an <q>internal</q> table, where Impala manages the underlying data files for the + table, and physically deletes the data files when you drop the table. If you specify the + <codeph>EXTERNAL</codeph> clause, Impala treats the table as an <q>external</q> table, where the data files + are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves + the data files in place when you drop the table. For details about internal and external tables, see + <xref href="impala_tables.xml#tables"/>. + </p> + + <p> + Typically, for an external table you include a <codeph>LOCATION</codeph> clause to specify the path to the + HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces + Parquet files in the HDFS directory <filepath>/user/etl/destination</filepath>, you might create an external + table as follows: + </p> + +<codeblock>CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP) + STORED AS PARQUET LOCATION '/user/etl/destination'; +</codeblock> + + <p> + Although the <codeph>EXTERNAL</codeph> and <codeph>LOCATION</codeph> clauses are often specified together, + <codeph>LOCATION</codeph> is optional for external tables, and you can also specify <codeph>LOCATION</codeph> + for internal tables. The difference is all about whether Impala <q>takes control</q> of the underlying data + files and moves them when you rename the table, or deletes them when you drop the table. For more about + internal and external tables and how they interact with the <codeph>LOCATION</codeph> attribute, see + <xref href="impala_tables.xml#tables"/>. + </p> + + <p> + <b>Partitioned tables (PARTITIONED BY clause):</b> + </p> + + <p> + The <codeph>PARTITIONED BY</codeph> clause divides the data files based on the values from one or more + specified columns. Impala queries can use the partition metadata to minimize the amount of data that is read + from disk or transmitted across the network, particularly during join queries. For details about + partitioning, see <xref href="impala_partitioning.xml#partitioning"/>. + </p> + + <p rev="2.5.0"> + Prior to <keyword keyref="impala25_full"/> you could use a partitioned table + as the source and copy data from it, but could not specify any partitioning clauses for the new table. + In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a + <codeph>CREATE TABLE AS SELECT</codeph> statement. See the examples under the following discussion of + the <codeph>CREATE TABLE AS SELECT</codeph> syntax variation. + </p> + +<!-- + <p rev="kudu"> + <b>Partitioning for Kudu tables (DISTRIBUTE BY clause)</b> + </p> + + <p rev="kudu"> + For Kudu tables, you specify logical partitioning across one or more columns using the + <codeph>DISTRIBUTE BY</codeph> clause. In contrast to partitioning for HDFS-based tables, + multiple values for a partition key column can be located in the same partition. + The optional <codeph>HASH</codeph> clause lets you divide one or a set of partition key columns + into a specified number of buckets; you can use more than one <codeph>HASH</codeph> + clause, specifying a distinct set of partition key columns for each. + The optional <codeph>RANGE</codeph> clause further subdivides the partitions, based on + a set of literal values for the partition key columns. + </p> +--> + + <p> + <b>Specifying file format (STORED AS and ROW FORMAT clauses):</b> + </p> + + <p rev="DOCS-1523"> + The <codeph>STORED AS</codeph> clause identifies the format of the underlying data files. Currently, Impala + can query more types of file formats than it can create or insert into. Use Hive to perform any create or + data load operations that are not currently available in Impala. For example, Impala can create an Avro, + SequenceFile, or RCFile table but cannot insert data into it. There are also Impala-specific procedures for using + compression with each kind of file format. For details about working with data files of various formats, see + <xref href="impala_file_formats.xml#file_formats"/>. + </p> + + <note> + In Impala 1.4.0 and higher, Impala can create Avro tables, which formerly required doing the <codeph>CREATE + TABLE</codeph> statement in Hive. See <xref href="impala_avro.xml#avro"/> for details and examples. + </note> + + <p> + By default (when no <codeph>STORED AS</codeph> clause is specified), data files in Impala tables are created + as text files with Ctrl-A (hex 01) characters as the delimiter. +<!-- Verify if ROW FORMAT is entirely ignored outside of text tables, or does it apply somehow to SequenceFile and/or RCFile too? --> + Specify the <codeph>ROW FORMAT DELIMITED</codeph> clause to produce or ingest data files that use a different + delimiter character such as tab or <codeph>|</codeph>, or a different line end character such as carriage + return or newline. When specifying delimiter and line end characters with the <codeph>FIELDS TERMINATED + BY</codeph> and <codeph>LINES TERMINATED BY</codeph> clauses, use <codeph>'\t'</codeph> for tab, + <codeph>'\n'</codeph> for newline or linefeed, <codeph>'\r'</codeph> for carriage return, and + <codeph>\</codeph><codeph>0</codeph> for ASCII <codeph>nul</codeph> (hex 00). For more examples of text + tables, see <xref href="impala_txtfile.xml#txtfile"/>. + </p> + + <p> + The <codeph>ESCAPED BY</codeph> clause applies both to text files that you create through an + <codeph>INSERT</codeph> statement to an Impala <codeph>TEXTFILE</codeph> table, and to existing data files + that you put into an Impala table directory. (You can ingest existing data files either by creating the table + with <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph>, the <codeph>LOAD DATA</codeph> statement, or + through an HDFS operation such as <codeph>hdfs dfs -put <varname>file</varname> + <varname>hdfs_path</varname></codeph>.) Choose an escape character that is not used anywhere else in the + file, and put it in front of each instance of the delimiter character that occurs within a field value. + Surrounding field values with quotation marks does not help Impala to parse fields with embedded delimiter + characters; the quotation marks are considered to be part of the column value. If you want to use + <codeph>\</codeph> as the escape character, specify the clause in <cmdname>impala-shell</cmdname> as + <codeph>ESCAPED BY '\\'</codeph>. + </p> + + <note conref="../shared/impala_common.xml#common/thorn"/> + + <p> + <b>Cloning tables (LIKE clause):</b> + </p> + + <p> + To create an empty table with the same columns, comments, and other attributes as another table, use the + following variation. The <codeph>CREATE TABLE ... LIKE</codeph> form allows a restricted set of clauses, + currently only the <codeph>LOCATION</codeph>, <codeph>COMMENT</codeph>, and <codeph>STORED AS</codeph> + clauses. + </p> + +<codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> + <ph rev="1.4.0">LIKE { [<varname>db_name</varname>.]<varname>table_name</varname> | PARQUET '<varname>hdfs_path_of_parquet_file</varname>' }</ph> + [COMMENT '<varname>table_comment</varname>'] + [STORED AS <varname>file_format</varname>] + [LOCATION '<varname>hdfs_path</varname>']</codeblock> + + <note rev="1.2.0"> + <p rev="1.2.0"> + To clone the structure of a table and transfer data into it in a single operation, use the <codeph>CREATE + TABLE AS SELECT</codeph> syntax described in the next subsection. + </p> + </note> + + <p> + When you clone the structure of an existing table using the <codeph>CREATE TABLE ... LIKE</codeph> syntax, + the new table keeps the same file format as the original one, so you only need to specify the <codeph>STORED + AS</codeph> clause if you want to use a different file format, or when specifying a view as the original + table. (Creating a table <q>like</q> a view produces a text table by default.) + </p> + + <p> + Although normally Impala cannot create an HBase table directly, Impala can clone the structure of an existing + HBase table with the <codeph>CREATE TABLE ... LIKE</codeph> syntax, preserving the file format and metadata + from the original table. + </p> + + <p> + There are some exceptions to the ability to use <codeph>CREATE TABLE ... LIKE</codeph> with an Avro table. + For example, you cannot use this technique for an Avro table that is specified with an Avro schema but no + columns. When in doubt, check if a <codeph>CREATE TABLE ... LIKE</codeph> operation works in Hive; if not, it + typically will not work in Impala either. + </p> + + <p> + If the original table is partitioned, the new table inherits the same partition key columns. Because the new + table is initially empty, it does not inherit the actual partitions that exist in the original one. To create + partitions in the new table, insert data or issue <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements. + </p> + + <p conref="../shared/impala_common.xml#common/create_table_like_view"/> + + <p> + Because <codeph>CREATE TABLE ... LIKE</codeph> only manipulates table metadata, not the physical data of the + table, issue <codeph>INSERT INTO TABLE</codeph> statements afterward to copy any data from the original table + into the new one, optionally converting the data to a new file format. (For some file formats, Impala can do + a <codeph>CREATE TABLE ... LIKE</codeph> to create the table, but Impala cannot insert data in that file + format; in these cases, you must load the data in Hive. See + <xref href="impala_file_formats.xml#file_formats"/> for details.) + </p> + + <p rev="1.2" id="ctas"> + <b>CREATE TABLE AS SELECT:</b> + </p> + + <p> + The <codeph>CREATE TABLE AS SELECT</codeph> syntax is a shorthand notation to create a table based on column + definitions from another table, and copy data from the source table to the destination table without issuing + any separate <codeph>INSERT</codeph> statement. This idiom is so popular that it has its own acronym, + <q>CTAS</q>. + </p> + + <p> + The following examples show how to copy data from a source table <codeph>T1</codeph> + to a variety of destinations tables, applying various transformations to the table + properties, table layout, or the data itself as part of the operation: + </p> + +<codeblock> +-- Sample table to be the source of CTAS operations. +CREATE TABLE t1 (x INT, y STRING); +INSERT INTO t1 VALUES (1, 'one'), (2, 'two'), (3, 'three'); + +-- Clone all the columns and data from one table to another. +CREATE TABLE clone_of_t1 AS SELECT * FROM t1; ++-------------------+ +| summary | ++-------------------+ +| Inserted 3 row(s) | ++-------------------+ + +-- Clone the columns and data, and convert the data to a different file format. +CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1; ++-------------------+ +| summary | ++-------------------+ +| Inserted 3 row(s) | ++-------------------+ + +-- Copy only some rows to the new table. +CREATE TABLE subset_of_t1 AS SELECT * FROM t1 WHERE x >= 2; ++-------------------+ +| summary | ++-------------------+ +| Inserted 2 row(s) | ++-------------------+ + +-- Same idea as CREATE TABLE LIKE: clone table layout but do not copy any data. +CREATE TABLE empty_clone_of_t1 AS SELECT * FROM t1 WHERE 1=0; ++-------------------+ +| summary | ++-------------------+ +| Inserted 0 row(s) | ++-------------------+ + +-- Reorder and rename columns and transform the data. +CREATE TABLE t5 AS SELECT upper(y) AS s, x+1 AS a, 'Entirely new column' AS n FROM t1; ++-------------------+ +| summary | ++-------------------+ +| Inserted 3 row(s) | ++-------------------+ +SELECT * FROM t5; ++-------+---+---------------------+ +| s | a | n | ++-------+---+---------------------+ +| ONE | 2 | Entirely new column | +| TWO | 3 | Entirely new column | +| THREE | 4 | Entirely new column | ++-------+---+---------------------+ +</codeblock> + +<!-- These are a little heavyweight to get into here. Therefore commenting out. + Some overlap with the new column-changing examples in the code listing above. +Create tables with different column order, names, or types than the original. +CREATE TABLE some_columns_from_t1 AS SELECT c1, c3, c5 FROM t1; +CREATE TABLE reordered_columns_from_t1 AS SELECT c4, c3, c1, c2 FROM t1; +CREATE TABLE synthesized_columns AS SELECT upper(c1) AS all_caps, c2+c3 AS total, "California" AS state FROM t1;</codeblock> +--> + +<!-- CREATE TABLE AS <select> now incorporated up higher in the original syntax diagram. --> + + <p rev="1.2"> + See <xref href="impala_select.xml#select"/> for details about query syntax for the <codeph>SELECT</codeph> + portion of a <codeph>CREATE TABLE AS SELECT</codeph> statement. + </p> + + <p rev="1.2"> + The newly created table inherits the column names that you select from the original table, which you can + override by specifying column aliases in the query. Any column or table comments from the original table are + not carried over to the new table. + </p> + + <note rev="DOCS-1523"> + When using the <codeph>STORED AS</codeph> clause with a <codeph>CREATE TABLE AS SELECT</codeph> + statement, the destination table must be a file format that Impala can write to: currently, + text or Parquet. You cannot specify an Avro, SequenceFile, or RCFile table as the destination + table for a CTAS operation. + </note> + + <p rev="2.5.0"> + Prior to <keyword keyref="impala25_full"/> you could use a partitioned table + as the source and copy data from it, but could not specify any partitioning clauses for the new table. + In <keyword keyref="impala25_full"/> and higher, you can now use the <codeph>PARTITIONED BY</codeph> clause with a + <codeph>CREATE TABLE AS SELECT</codeph> statement. The following example demonstrates how you can copy + data from an unpartitioned table in a <codeph>CREATE TABLE AS SELECT</codeph> operation, creating a new + partitioned table in the process. The main syntax consideration is the column order in the <codeph>PARTITIONED BY</codeph> + clause and the select list: the partition key columns must be listed last in the select list, in the same + order as in the <codeph>PARTITIONED BY</codeph> clause. Therefore, in this case, the column order in the + destination table is different from the source table. You also only specify the column names in the + <codeph>PARTITIONED BY</codeph> clause, not the data types or column comments. + </p> + +<codeblock rev="2.5.0"> +create table partitions_no (year smallint, month tinyint, s string); +insert into partitions_no values (2016, 1, 'January 2016'), + (2016, 2, 'February 2016'), (2016, 3, 'March 2016'); + +-- Prove that the source table is not partitioned. +show partitions partitions_no; +ERROR: AnalysisException: Table is not partitioned: ctas_partition_by.partitions_no + +-- Create new table with partitions based on column values from source table. +<b>create table partitions_yes partitioned by (year, month) + as select s, year, month from partitions_no;</b> ++-------------------+ +| summary | ++-------------------+ +| Inserted 3 row(s) | ++-------------------+ + +-- Prove that the destination table is partitioned. +show partitions partitions_yes; ++-------+-------+-------+--------+------+... +| year | month | #Rows | #Files | Size |... ++-------+-------+-------+--------+------+... +| 2016 | 1 | -1 | 1 | 13B |... +| 2016 | 2 | -1 | 1 | 14B |... +| 2016 | 3 | -1 | 1 | 11B |... +| Total | | -1 | 3 | 38B |... ++-------+-------+-------+--------+------+... +</codeblock> + + <p rev="2.5.0"> + The most convenient layout for partitioned tables is with all the + partition key columns at the end. The CTAS <codeph>PARTITIONED BY</codeph> syntax + requires that column order in the select list, resulting in that same + column order in the destination table. + </p> + +<codeblock rev="2.5.0"> +describe partitions_no; ++-------+----------+---------+ +| name | type | comment | ++-------+----------+---------+ +| year | smallint | | +| month | tinyint | | +| s | string | | ++-------+----------+---------+ + +-- The CTAS operation forced us to put the partition key columns last. +-- Having those columns last works better with idioms such as SELECT * +-- for partitioned tables. +describe partitions_yes; ++-------+----------+---------+ +| name | type | comment | ++-------+----------+---------+ +| s | string | | +| year | smallint | | +| month | tinyint | | ++-------+----------+---------+ +</codeblock> + + <p rev="2.5.0"> + Attempting to use a select list with the partition key columns + not at the end results in an error due to a column name mismatch: + </p> + +<codeblock rev="2.5.0"> +-- We expect this CTAS to fail because non-key column S +-- comes after key columns YEAR and MONTH in the select list. +create table partitions_maybe partitioned by (year, month) + as select year, month, s from partitions_no; +ERROR: AnalysisException: Partition column name mismatch: year != month +</codeblock> + + <p rev="1.2"> + For example, the following statements show how you can clone all the data in a table, or a subset of the + columns and/or rows, or reorder columns, rename them, or construct them out of expressions: + </p> + + <p rev="1.2"> + As part of a CTAS operation, you can convert the data to any file format that Impala can write (currently, + <codeph>TEXTFILE</codeph> and <codeph>PARQUET</codeph>). You cannot specify the lower-level properties of a + text table, such as the delimiter. + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p rev="1.4.0"> + <b>CREATE TABLE LIKE PARQUET:</b> + </p> + + <p rev="1.4.0"> + The variation <codeph>CREATE TABLE ... LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'</codeph> + lets you skip the column definitions of the <codeph>CREATE TABLE</codeph> statement. The column names and + data types are automatically configured based on the organization of the specified Parquet data file, which + must already reside in HDFS. You can use a data file located outside the Impala database directories, or a + file from an existing Impala Parquet table; either way, Impala only uses the column definitions from the file + and does not use the HDFS location for the <codeph>LOCATION</codeph> attribute of the new table. (Although + you can also specify the enclosing directory with the <codeph>LOCATION</codeph> attribute, to both use the + same schema as the data file and point the Impala table at the associated directory for querying.) + </p> + + <p rev="1.4.0"> + The following considerations apply when you use the <codeph>CREATE TABLE LIKE PARQUET</codeph> technique: + </p> + + <ul rev="1.4.0"> + <li> + Any column comments from the original table are not preserved in the new table. Each column in the new + table has a comment stating the low-level Parquet field type used to deduce the appropriate SQL column + type. + </li> + + <li> + If you use a data file from a partitioned Impala table, any partition key columns from the original table + are left out of the new table, because they are represented in HDFS directory names rather than stored in + the data file. To preserve the partition information, repeat the same <codeph>PARTITION</codeph> clause as + in the original <codeph>CREATE TABLE</codeph> statement. + </li> + + <li> + The file format of the new table defaults to text, as with other kinds of <codeph>CREATE TABLE</codeph> + statements. To make the new table also use Parquet format, include the clause <codeph>STORED AS + PARQUET</codeph> in the <codeph>CREATE TABLE LIKE PARQUET</codeph> statement. + </li> + + <li> + If the Parquet data file comes from an existing Impala table, currently, any <codeph>TINYINT</codeph> or + <codeph>SMALLINT</codeph> columns are turned into <codeph>INT</codeph> columns in the new table. + Internally, Parquet stores such values as 32-bit integers. + </li> + + <li> + When the destination table uses the Parquet file format, the <codeph>CREATE TABLE AS SELECT</codeph> and + <codeph>INSERT ... SELECT</codeph> statements always create at least one data file, even if the + <codeph>SELECT</codeph> part of the statement does not match any rows. You can use such an empty Parquet + data file as a template for subsequent <codeph>CREATE TABLE LIKE PARQUET</codeph> statements. + </li> + </ul> + + <p> + For more details about creating Parquet tables, and examples of the <codeph>CREATE TABLE LIKE + PARQUET</codeph> syntax, see <xref href="impala_parquet.xml#parquet"/>. + </p> + + <p> + <b>Visibility and Metadata (TBLPROPERTIES and WITH SERDEPROPERTIES clauses):</b> + </p> + + <p rev="1.2"> + You can associate arbitrary items of metadata with a table by specifying the <codeph>TBLPROPERTIES</codeph> + clause. This clause takes a comma-separated list of key-value pairs and stores those items in the metastore + database. You can also change the table properties later with an <codeph>ALTER TABLE</codeph> statement. You + can observe the table properties for different delimiter and escape characters using the <codeph>DESCRIBE + FORMATTED</codeph> command, and change those settings for an existing table with <codeph>ALTER TABLE ... SET + TBLPROPERTIES</codeph>. + </p> + + <p rev="1.2"> + You can also associate SerDes properties with the table by specifying key-value pairs through the + <codeph>WITH SERDEPROPERTIES</codeph> clause. This metadata is not used by Impala, which has its own built-in + serializer and deserializer for the file formats it supports. Particular property values might be needed for + Hive compatibility with certain variations of file formats, particularly Avro. + </p> + + <p> + Some DDL operations that interact with other Hadoop components require specifying particular values in the + <codeph>SERDEPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> fields, such as creating an Avro table or + an HBase table. (You typically create HBase tables in Hive, because they require additional clauses not + currently available in Impala.) +<!-- Haven't got a working example from Lenni, so suppressing this recommendation for now. + The Avro schema properties can be specified through either + <codeph>TBLPROPERTIES</codeph> or <codeph>SERDEPROPERTIES</codeph>; + for best compatibility with future versions of Hive, + use <codeph>SERDEPROPERTIES</codeph> in this case. +--> + </p> + + <p> + To see the column definitions and column comments for an existing table, for example before issuing a + <codeph>CREATE TABLE ... LIKE</codeph> or a <codeph>CREATE TABLE ... AS SELECT</codeph> statement, issue the + statement <codeph>DESCRIBE <varname>table_name</varname></codeph>. To see even more detail, such as the + location of data files and the values for clauses such as <codeph>ROW FORMAT</codeph> and <codeph>STORED + AS</codeph>, issue the statement <codeph>DESCRIBE FORMATTED <varname>table_name</varname></codeph>. + <codeph>DESCRIBE FORMATTED</codeph> is also needed to see any overall table comment (as opposed to individual + column comments). + </p> + + <p> + After creating a table, your <cmdname>impala-shell</cmdname> session or another + <cmdname>impala-shell</cmdname> connected to the same node can immediately query that table. There might be a + brief interval (one statestore heartbeat) before the table can be queried through a different Impala node. To + make the <codeph>CREATE TABLE</codeph> statement return only when the table is recognized by all Impala nodes + in the cluster, enable the <codeph>SYNC_DDL</codeph> query option. + </p> + + <p rev="1.4.0"> + <b>HDFS caching (CACHED IN clause):</b> + </p> + + <p rev="1.4.0"> + If you specify the <codeph>CACHED IN</codeph> clause, any existing or future data files in the table + directory or the partition subdirectories are designated to be loaded into memory with the HDFS caching + mechanism. See <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/> for details about using the HDFS + caching feature. + </p> + + <p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/> + +<!-- Say something in here about the SHOW statement, e.g. SHOW TABLES, SHOW TABLE/COLUMN STATS, SHOW PARTITIONS. --> + + <p> + <b>Column order</b>: + </p> + + <p> + If you intend to use the table to hold data files produced by some external source, specify the columns in + the same order as they appear in the data files. + </p> + + <p> + If you intend to insert or copy data into the table through Impala, or if you have control over the way + externally produced data files are arranged, use your judgment to specify columns in the most convenient + order: + </p> + + <ul> + <li> + <p> + If certain columns are often <codeph>NULL</codeph>, specify those columns last. You might produce data + files that omit these trailing columns entirely. Impala automatically fills in the <codeph>NULL</codeph> + values if so. + </p> + </li> + + <li> + <p> + If an unpartitioned table will be used as the source for an <codeph>INSERT ... SELECT</codeph> operation + into a partitioned table, specify last in the unpartitioned table any columns that correspond to + partition key columns in the partitioned table, and in the same order as the partition key columns are + declared in the partitioned table. This technique lets you use <codeph>INSERT ... SELECT *</codeph> when + copying data to the partitioned table, rather than specifying each column name individually. + </p> + </li> + + <li> + <p> + If you specify columns in an order that you later discover is suboptimal, you can sometimes work around + the problem without recreating the table. You can create a view that selects columns from the original + table in a permuted order, then do a <codeph>SELECT *</codeph> from the view. When inserting data into a + table, you can specify a permuted order for the inserted columns to match the order in the destination + table. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/hive_blurb"/> + + <p> + Impala queries can make use of metadata about the table and columns, such as the number of rows in a table or + the number of different values in a column. Prior to Impala 1.2.2, to create this metadata, you issued the + <codeph>ANALYZE TABLE</codeph> statement in Hive to gather this information, after creating the table and + loading representative data into it. In Impala 1.2.2 and higher, the <codeph>COMPUTE STATS</codeph> statement + produces these statistics within Impala, without needing to use Hive at all. + </p> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + + <note> + <p> + The Impala <codeph>CREATE TABLE</codeph> statement cannot create an HBase table, because it currently does + not support the <codeph>STORED BY</codeph> clause needed for HBase tables. Create such tables in Hive, then + query them through Impala. For information on using Impala with HBase tables, see + <xref href="impala_hbase.xml#impala_hbase"/>. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p rev="2.2.0"> + To create a table where the data resides in the Amazon Simple Storage Service (S3), + specify a <codeph>s3a://</codeph> prefix <codeph>LOCATION</codeph> attribute pointing to the data files in S3. + </p> + + <p rev="2.6.0 CDH-39913 IMPALA-1878"> + In <keyword keyref="impala26_full"/> and higher, you can + use this special <codeph>LOCATION</codeph> syntax + as part of a <codeph>CREATE TABLE AS SELECT</codeph> statement. + </p> + + <p conref="../shared/impala_common.xml#common/s3_ddl"/> + + <p conref="../shared/impala_common.xml#common/insert_sort_blurb"/> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + The <codeph>CREATE TABLE</codeph> statement for an internal table creates a directory in HDFS. The + <codeph>CREATE EXTERNAL TABLE</codeph> statement associates the table with an existing HDFS directory, and + does not create any new directory in HDFS. To locate the HDFS data directory for a table, issue a + <codeph>DESCRIBE FORMATTED <varname>table</varname></codeph> statement. To examine the contents of that HDFS + directory, use an OS command such as <codeph>hdfs dfs -ls hdfs://<varname>path</varname></codeph>, either + from the OS command line or through the <codeph>shell</codeph> or <codeph>!</codeph> commands in + <cmdname>impala-shell</cmdname>. + </p> + + <p> + The <codeph>CREATE TABLE AS SELECT</codeph> syntax creates data files under the table data directory to hold + any data copied by the <codeph>INSERT</codeph> portion of the statement. (Even if no data is copied, Impala + might create one or more empty data files.) + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have both execute and write + permission for the database directory where the table is being created. + </p> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_maybe"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#tables"/>, + <xref href="impala_alter_table.xml#alter_table"/>, <xref href="impala_drop_table.xml#drop_table"/>, + <xref href="impala_partitioning.xml#partitioning"/>, <xref href="impala_tables.xml#internal_tables"/>, + <xref href="impala_tables.xml#external_tables"/>, <xref href="impala_compute_stats.xml#compute_stats"/>, + <xref href="impala_sync_ddl.xml#sync_ddl"/>, <xref href="impala_show.xml#show_tables"/>, + <xref href="impala_show.xml#show_create_table"/>, <xref href="impala_describe.xml#describe"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_view.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_view.xml b/docs/topics/impala_create_view.xml new file mode 100644 index 0000000..446acf2 --- /dev/null +++ b/docs/topics/impala_create_view.xml @@ -0,0 +1,139 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.1" id="create_view"> + + <title>CREATE VIEW Statement</title> + <titlealts audience="PDF"><navtitle>CREATE VIEW</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Views"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE VIEW statement</indexterm> + The <codeph>CREATE VIEW</codeph> statement lets you create a shorthand abbreviation for a more complicated + query. The base query can involve joins, expressions, reordered columns, column aliases, and other SQL + features that can make a query hard to understand or maintain. + </p> + + <p> + Because a view is purely a logical construct (an alias for a query) with no physical data behind it, + <codeph>ALTER VIEW</codeph> only involves changes to metadata in the metastore database, not any data files + in HDFS. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CREATE VIEW [IF NOT EXISTS] <varname>view_name</varname> [(<varname>column_list</varname>)] + AS <varname>select_statement</varname></codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + The <codeph>CREATE VIEW</codeph> statement can be useful in scenarios such as the following: + </p> + + <ul> + <li> + To turn even the most lengthy and complicated SQL query into a one-liner. You can issue simple queries + against the view from applications, scripts, or interactive queries in <cmdname>impala-shell</cmdname>. + For example: +<codeblock>select * from <varname>view_name</varname>; +select * from <varname>view_name</varname> order by c1 desc limit 10;</codeblock> + The more complicated and hard-to-read the original query, the more benefit there is to simplifying the + query using a view. + </li> + + <li> + To hide the underlying table and column names, to minimize maintenance problems if those names change. In + that case, you re-create the view using the new names, and all queries that use the view rather than the + underlying tables keep running with no changes. + </li> + + <li> + To experiment with optimization techniques and make the optimized queries available to all applications. + For example, if you find a combination of <codeph>WHERE</codeph> conditions, join order, join hints, and so + on that works the best for a class of queries, you can establish a view that incorporates the + best-performing techniques. Applications can then make relatively simple queries against the view, without + repeating the complicated and optimized logic over and over. If you later find a better way to optimize the + original query, when you re-create the view, all the applications immediately take advantage of the + optimized base query. + </li> + + <li> + To simplify a whole class of related queries, especially complicated queries involving joins between + multiple tables, complicated expressions in the column list, and other SQL syntax that makes the query + difficult to understand and debug. For example, you might create a view that joins several tables, filters + using several <codeph>WHERE</codeph> conditions, and selects several columns from the result set. + Applications might issue queries against this view that only vary in their <codeph>LIMIT</codeph>, + <codeph>ORDER BY</codeph>, and similar simple clauses. + </li> + </ul> + + <p> + For queries that require repeating complicated clauses over and over again, for example in the select list, + <codeph>ORDER BY</codeph>, and <codeph>GROUP BY</codeph> clauses, you can use the <codeph>WITH</codeph> + clause as an alternative to creating a view. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + <p conref="../shared/impala_common.xml#common/complex_types_views"/> + <p conref="../shared/impala_common.xml#common/complex_types_views_caveat"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<!-- TK: Elaborate on these, show queries and real output. --> + +<codeblock>-- Create a view that is exactly the same as the underlying table. +create view v1 as select * from t1; + +-- Create a view that includes only certain columns from the underlying table. +create view v2 as select c1, c3, c7 from t1; + +-- Create a view that filters the values from the underlying table. +create view v3 as select distinct c1, c3, c7 from t1 where c1 is not null and c5 > 0; + +-- Create a view that that reorders and renames columns from the underlying table. +create view v4 as select c4 as last_name, c6 as address, c2 as birth_date from t1; + +-- Create a view that runs functions to convert or transform certain columns. +create view v5 as select c1, cast(c3 as string) c3, concat(c4,c5) c5, trim(c6) c6, "Constant" c8 from t1; + +-- Create a view that hides the complexity of a view query. +create view v6 as select t1.c1, t2.c2 from t1 join t2 on t1.id = t2.id; +</codeblock> + +<!-- These examples show CREATE VIEW and corresponding DROP VIEW statements, with different combinations + of qualified and unqualified names. --> + + <p conref="../shared/impala_common.xml#common/create_drop_view_examples"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_views.xml#views"/>, <xref href="impala_alter_view.xml#alter_view"/>, + <xref href="impala_drop_view.xml#drop_view"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_data_sources.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_data_sources.xml b/docs/topics/impala_data_sources.xml new file mode 100644 index 0000000..8fb46e1 --- /dev/null +++ b/docs/topics/impala_data_sources.xml @@ -0,0 +1,22 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.4.0" id="data_sources"> + + <title>Data Sources</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <xref href="impala_create_data_source.xml#create_data_source"/> + <xref href="impala_drop_data_source.xml#drop_data_source"/> + <xref href="impala_create_table.xml#create_table"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_databases.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_databases.xml b/docs/topics/impala_databases.xml new file mode 100644 index 0000000..3253e75 --- /dev/null +++ b/docs/topics/impala_databases.xml @@ -0,0 +1,65 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="databases"> + + <title>Overview of Impala Databases</title> + <titlealts audience="PDF"><navtitle>Databases</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Databases"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + In Impala, a database is a logical container for a group of tables. Each database defines a separate + namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different + databases can contain tables with identical names. + </p> + + <p> + Creating a database is a lightweight operation. There are minimal database-specific properties to configure, + only <codeph>LOCATION</codeph> and <codeph>COMMENT</codeph>. There is no <codeph>ALTER DATABASE</codeph> statement. + </p> + + <p> + Typically, you create a separate database for each project or application, to avoid naming conflicts between + tables and to make clear which tables are related to each other. The <codeph>USE</codeph> statement lets + you switch between databases. Unqualified references to tables, views, and functions refer to objects + within the current database. You can also refer to objects in other databases by using qualified names + of the form <codeph><varname>dbname</varname>.<varname>object_name</varname></codeph>. + </p> + + <p> + Each database is physically represented by a directory in HDFS. When you do not specify a <codeph>LOCATION</codeph> + attribute, the directory is located in the Impala data directory with the associated tables managed by Impala. + When you do specify a <codeph>LOCATION</codeph> attribute, any read and write operations for tables in that + database are relative to the specified HDFS directory. + </p> + + <p> + There is a special database, named <codeph>default</codeph>, where you begin when you connect to Impala. + Tables created in <codeph>default</codeph> are physically located one level higher in HDFS than all the + user-created databases. + </p> + + <p conref="../shared/impala_common.xml#common/builtins_db"/> + + <p> + <b>Related statements:</b> + </p> + + <p> + <xref href="impala_create_database.xml#create_database"/>, + <xref href="impala_drop_database.xml#drop_database"/>, <xref href="impala_use.xml#use"/>, + <xref href="impala_show.xml#show_databases"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_datatypes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_datatypes.xml b/docs/topics/impala_datatypes.xml new file mode 100644 index 0000000..e45867e --- /dev/null +++ b/docs/topics/impala_datatypes.xml @@ -0,0 +1,43 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="datatypes"> + + <title>Data Types</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">data types</indexterm> + Impala supports a set of data types that you can use for table columns, expression values, and function + arguments and return values. + </p> + + <note> + Currently, Impala supports only scalar types, not composite or nested types. Accessing a table containing any + columns with unsupported types causes an error. + </note> + + <p outputclass="toc"/> + + <p> + For the notation to write literals of each of these data types, see + <xref href="impala_literals.xml#literals"/>. + </p> + + <p> + See <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/> for differences between Impala and + Hive data types. + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_date.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_date.xml b/docs/topics/impala_date.xml new file mode 100644 index 0000000..ae17b9a --- /dev/null +++ b/docs/topics/impala_date.xml @@ -0,0 +1,104 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept audience="Cloudera" id="date" rev="2.0.0"> + + <title>DATE Data Type (<keyword keyref="impala21"/> or higher only)</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Dates and Times"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DATE data type</indexterm> + A type representing the date (year, month, and day) as a single numeric value. Used to represent a broader + date range than possible with the <codeph>TIMESTAMP</codeph> type, with fewer distinct values than + <codeph>TIMESTAMP</codeph>, and in a more compact and efficient form than using a <codeph>STRING</codeph> + such as <codeph>'2014-12-31'</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>column_name</varname> DATE</codeblock> + + <p> + <b>Range:</b> January 1, -4712 BC .. December 31, 9999 AD. + </p> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_blurb"/> + + <ul> + <li> + This type can be read from and written to Parquet files. + </li> + + <li> + There is no requirement for a particular level of Parquet. + </li> + + <li> + Parquet files generated by Impala and containing this type can be freely interchanged with other components + such as Hive and MapReduce. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/hive_blurb"/> + + <p> + TK. + </p> + + <p conref="../shared/impala_common.xml#common/conversion_blurb"/> + + <p> + TK. + </p> + + <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> + + <p> + This type can be used for partition key columns. Because it has less granularity (and thus fewer distinct + values) than an equivalent <codeph>TIMESTAMP</codeph> column, and numeric columns are more efficient as + partition keys than strings, prefer to partition by a <codeph>DATE</codeph> column rather than a + <codeph>TIMESTAMP</codeph> column or a <codeph>STRING</codeph> representation of a date. + </p> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + This type is available on CDH 5.2 or higher. + </p> + + <p conref="../shared/impala_common.xml#common/internals_2_bytes"/> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + Things happen when converting <codeph>TIMESTAMP</codeph> to <codeph>DATE</codeph> or <codeph>DATE</codeph> to + <codeph>TIMESTAMP</codeph>. TK. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + The <xref href="impala_timestamp.xml#timestamp">TIMESTAMP</xref> data type is closely related. Some functions + from <xref href="impala_datetime_functions.xml#datetime_functions"/> accept and return <codeph>DATE</codeph> + values. + </p> + </conbody> +</concept>
