http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_ldap.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_ldap.xml b/docs/topics/impala_ldap.xml new file mode 100644 index 0000000..f2ef523 --- /dev/null +++ b/docs/topics/impala_ldap.xml @@ -0,0 +1,354 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="ldap"> + + <title>Enabling LDAP Authentication for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Security"/> + <data name="Category" value="LDAP"/> + <data name="Category" value="Authentication"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Starting and Stopping"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + + <conbody> + +<!-- Similar discussion under 'Authentication' parent topic. Maybe do some conref'ing or linking upward. --> + + <p> Authentication is the process of allowing only specified named users to + access the server (in this case, the Impala server). This feature is + crucial for any production deployment, to prevent misuse, tampering, or + excessive load on the server. Impala uses LDAP for authentication, + verifying the credentials of each user who connects through + <cmdname>impala-shell</cmdname>, Hue, a Business Intelligence tool, JDBC + or ODBC application, and so on. </p> + + <note conref="../shared/impala_common.xml#common/authentication_vs_authorization"/> + + <p outputclass="toc inpage"/> + + <p> + An alternative form of authentication you can use is Kerberos, described in + <xref href="impala_kerberos.xml#kerberos"/>. + </p> + </conbody> + + <concept id="ldap_prereqs"> + + <title>Requirements for Using Impala with LDAP</title> + <prolog> + <metadata> + <data name="Category" value="Requirements"/> + <data name="Category" value="Planning"/> + </metadata> + </prolog> + + <conbody> + + <p rev="1.4.0"> + Authentication against LDAP servers is available in Impala 1.2.2 and higher. Impala 1.4.0 adds support for + secure LDAP authentication through SSL and TLS. + </p> + + <p> + The Impala LDAP support lets you use Impala with systems such as Active Directory that use LDAP behind the + scenes. + </p> + </conbody> + </concept> + + <concept id="ldap_client_server"> + + <title>Client-Server Considerations for LDAP</title> + + <conbody> + + <p> + Only client->Impala connections can be authenticated by LDAP. + </p> + + <p> You must use the Kerberos authentication mechanism for connections + between internal Impala components, such as between the + <cmdname>impalad</cmdname>, <cmdname>statestored</cmdname>, and + <cmdname>catalogd</cmdname> daemons. See <xref + href="impala_kerberos.xml#kerberos" /> on how to set up Kerberos for + Impala. </p> + </conbody> + </concept> + + <concept id="ldap_config"> + + <title>Server-Side LDAP Setup</title> + + <conbody> + + <p> + These requirements apply on the server side when configuring and starting Impala: + </p> + + <p> + To enable LDAP authentication, set the following startup options for <cmdname>impalad</cmdname>: + </p> + + <ul> + <li> + <codeph>--enable_ldap_auth</codeph> enables LDAP-based authentication between the client and Impala. + </li> + + <li rev="1.4.0"> + <codeph>--ldap_uri</codeph> sets the URI of the LDAP server to use. Typically, the URI is prefixed with + <codeph>ldap://</codeph>. In Impala 1.4.0 and higher, you can specify secure SSL-based LDAP transport by + using the prefix <codeph>ldaps://</codeph>. The URI can optionally specify the port, for example: + <codeph>ldap://ldap_server.cloudera.com:389</codeph> or + <codeph>ldaps://ldap_server.cloudera.com:636</codeph>. (389 and 636 are the default ports for non-SSL and + SSL LDAP connections, respectively.) + </li> + +<!-- Some amount of this bullet could be conref'ed. Similar but not identical bullet occurs later under TLS. --> + + <li rev="1.4.0"> + For <codeph>ldaps://</codeph> connections secured by SSL, + <codeph>--ldap_ca_certificate="<varname>/path/to/certificate/pem</varname>"</codeph> specifies the + location of the certificate in standard <codeph>.PEM</codeph> format. Store this certificate on the local + filesystem, in a location that only the <codeph>impala</codeph> user and other trusted users can read. + </li> + +<!-- Per Henry: not for public consumption. +<li> + If you need to provide a custom SASL configuration, + set <codeph>- -ldap_manual_config</codeph> to bypass all the automatic configuration. +</li> +--> + </ul> + </conbody> + </concept> + + <concept id="ldap_bind_strings"> + + <title>Support for Custom Bind Strings</title> + + <conbody> + + <p> + When Impala connects to LDAP it issues a bind call to the LDAP server to authenticate as the connected + user. Impala clients, including the Impala shell, provide the short name of the user to Impala. This is + necessary so that Impala can use Sentry for role-based access, which uses short names. + </p> + + <p> + However, LDAP servers often require more complex, structured usernames for authentication. Impala supports + three ways of transforming the short name (for example, <codeph>'henry'</codeph>) to a more complicated + string. If necessary, specify one of the following configuration options when starting the + <cmdname>impalad</cmdname> daemon on each DataNode: + </p> + + <ul> + <li> + <codeph>--ldap_domain</codeph>: Replaces the username with a string + <codeph><varname>username</varname>@<varname>ldap_domain</varname></codeph>. + </li> + + <li> + <codeph>--ldap_baseDN</codeph>: Replaces the username with a <q>distinguished name</q> (DN) of the form: + <codeph>uid=<varname>userid</varname>,ldap_baseDN</codeph>. (This is equivalent to a Hive option). + </li> + + <li> + <codeph>--ldap_bind_pattern</codeph>: This is the most general option, and replaces the username with the + string <varname>ldap_bind_pattern</varname> where all instances of the string <codeph>#UID</codeph> are + replaced with <varname>userid</varname>. For example, an <codeph>ldap_bind_pattern</codeph> of + <codeph>"user=#UID,OU=foo,CN=bar"</codeph> with a username of <codeph>henry</codeph> will construct a + bind name of <codeph>"user=henry,OU=foo,CN=bar"</codeph>. + </li> + </ul> + + <p rev="CDH-26854"> + For clusters not managed by Cloudera Manager, + specify the option on the <cmdname>impalad</cmdname> command line. + For clusters managed by Cloudera Manager 5.4.0 and higher, + search for the configuration field names <codeph>ldap_domain</codeph>, + <codeph>ldap_basedn</codeph>, or <codeph>ldap_bind_pattern</codeph>, + fill in and save the appropriate field values, and restart the Impala service. + Prior to Cloudera Manager 5.4.0, these values were filled in using the + <uicontrol>Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)</uicontrol> + field. + </p> + + <p> + These options are mutually exclusive; Impala does not start if more than one of these options is specified. + </p> + </conbody> + </concept> + + <concept id="ldap_security"> + + <title>Secure LDAP Connections</title> + + <conbody> + + <p> + To avoid sending credentials over the wire in cleartext, you must configure a secure connection between + both the client and Impala, and between Impala and the LDAP server. The secure connection could use SSL or + TLS. + </p> + + <p> + <b>Secure LDAP connections through SSL:</b> + </p> + + <p> + For SSL-enabled LDAP connections, specify a prefix of <codeph>ldaps://</codeph> instead of + <codeph>ldap://</codeph>. Also, the default port for SSL-enabled LDAP connections is 636 instead of 389. + </p> + + <p rev="1.4.0"> + <b>Secure LDAP connections through TLS:</b> + </p> + + <p> + <xref href="http://en.wikipedia.org/wiki/Transport_Layer_Security" scope="external" format="html">TLS</xref>, + the successor to the SSL protocol, is supported by most modern LDAP servers. Unlike SSL connections, TLS + connections can be made on the same server port as non-TLS connections. To secure all connections using + TLS, specify the following flags as startup options to the <cmdname>impalad</cmdname> daemon: + </p> + + <ul> + <li> + <codeph>--ldap_tls</codeph> tells Impala to start a TLS connection to the LDAP server, and to fail + authentication if it cannot be done. + </li> + + <li rev="1.4.0"> + <codeph>--ldap_ca_certificate="<varname>/path/to/certificate/pem</varname>"</codeph> specifies the + location of the certificate in standard <codeph>.PEM</codeph> format. Store this certificate on the local + filesystem, in a location that only the <codeph>impala</codeph> user and other trusted users can read. + </li> + </ul> + </conbody> + </concept> + + <concept id="ldap_impala_shell"> + + <title>LDAP Authentication for impala-shell Interpreter</title> + + <conbody> + + <p> + To connect to Impala using LDAP authentication, you specify command-line options to the + <cmdname>impala-shell</cmdname> command interpreter and enter the password when prompted: + </p> + + <ul> + <li> + <codeph>-l</codeph> enables LDAP authentication. + </li> + + <li> + <codeph>-u</codeph> sets the user. Per Active Directory, the user is the short username, not the full + LDAP distinguished name. If your LDAP settings include a search base, use the + <codeph>--ldap_bind_pattern</codeph> on the <cmdname>impalad</cmdname> daemon to translate the short user + name from <cmdname>impala-shell</cmdname> automatically to the fully qualified name. +<!-- +include that as part of the +username, for example <codeph>[email protected]</codeph>. +--> + </li> + + <li> + <cmdname>impala-shell</cmdname> automatically prompts for the password. + </li> + </ul> + + <p> + For the full list of available <cmdname>impala-shell</cmdname> options, see + <xref href="impala_shell_options.xml#shell_options"/>. + </p> + + <p> + <b>LDAP authentication for JDBC applications:</b> See <xref href="impala_jdbc.xml#impala_jdbc"/> for the + format to use with the JDBC connection string for servers using LDAP authentication. + </p> + </conbody> + </concept> + <concept id="ldap_impala_hue"> + <title>Enabling LDAP for Impala in Hue</title> + <prolog> + <metadata> + <data name="Category" value="Hue"/> + </metadata> + </prolog> + <conbody> + <section id="ldap_impala_hue_cm"> + <title>Enabling LDAP for Impala in Hue Using Cloudera Manager</title> + <p> + <ol> + <li>Go to the Hue service.</li> + <li>Click the Configuration tab.</li> + <li>Select <menucascade><uicontrol>Scope</uicontrol><uicontrol>Hue + Server</uicontrol></menucascade>.</li> + <li>Select + <menucascade><uicontrol>Category</uicontrol><uicontrol>Advanced</uicontrol></menucascade>.</li> + <li>Add the following properties to the <b>Hue Server Advanced + Configuration Snippet (Safety Valve) for + hue_safety_valve_server.ini</b> + property.<codeblock>[impala] +auth_username=<LDAP username of Hue user to be authenticated> +auth_password=<LDAP password of Hue user to be authenticated></codeblock></li> + <li>Click <b>Save Changes</b>.</li> + </ol> + </p> + </section> + <section id="ldap_impala_hue_cmdline"> + <title>Enabling LDAP for Impala in Hue Using the Command Line</title> + <p>LDAP authentication for the Impala app in Hue can be enabled by + setting the following properties under the <codeph>[impala]</codeph> + section in <codeph>hue.ini</codeph>. <table id="ldap_impala_hue_configs"> + <tgroup cols="2"> + <colspec colname="1" colwidth="1*" /> + <colspec colname="2" colwidth="2*" /> + <tbody> + <row> + <entry><codeph>auth_username</codeph></entry> + <entry>LDAP username of Hue user to be authenticated.</entry> + </row> + <row> + <entry><codeph>auth_password</codeph></entry> + <entry> + <p>LDAP password of Hue user to be authenticated.</p> + </entry> + </row> + </tbody> + </tgroup> + </table>These login details are only used by Impala to authenticate to + LDAP. The Impala service trusts Hue to have already validated the user + being impersonated, rather than simply passing on the credentials.</p> + </section> + </conbody> + </concept> + + <concept id="ldap_delegation"> + <title>Enabling Impala Delegation for LDAP Users</title> + <conbody> + <p> + See <xref href="impala_delegation.xml#delegation"/> for details about the delegation feature + that lets certain users submit queries using the credentials of other users. + </p> + </conbody> + </concept> + + <concept id="ldap_restrictions"> + + <title>LDAP Restrictions for Impala</title> + + <conbody> + + <p> + The LDAP support is preliminary. It currently has only been tested against Active Directory. + </p> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_limit.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_limit.xml b/docs/topics/impala_limit.xml new file mode 100644 index 0000000..ec12271 --- /dev/null +++ b/docs/topics/impala_limit.xml @@ -0,0 +1,151 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="limit"> + + <title>LIMIT Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Reports"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The <codeph>LIMIT</codeph> clause in a <codeph>SELECT</codeph> query sets a maximum number of rows for the + result set. Pre-selecting the maximum size of the result set helps Impala to optimize memory usage while + processing a distributed query. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LIMIT <varname>constant_integer_expression</varname></codeblock> + + <p> + The argument to the <codeph>LIMIT</codeph> clause must evaluate to a constant value. It can be a numeric + literal, or another kind of numeric expression involving operators, casts, and function return values. You + cannot refer to a column or use a subquery. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + This clause is useful in contexts such as: + </p> + + <ul> + <li> + To return exactly N items from a top-N query, such as the 10 highest-rated items in a shopping category or + the 50 hostnames that refer the most traffic to a web site. + </li> + + <li> + To demonstrate some sample values from a table or a particular query. (To display some arbitrary items, use + a query with no <codeph>ORDER BY</codeph> clause. An <codeph>ORDER BY</codeph> clause causes additional + memory and/or disk usage during the query.) + </li> + + <li> + To keep queries from returning huge result sets by accident if a table is larger than expected, or a + <codeph>WHERE</codeph> clause matches more rows than expected. + </li> + </ul> + + <p rev="1.2.1"> + Originally, the value for the <codeph>LIMIT</codeph> clause had to be a numeric literal. In Impala 1.2.1 and + higher, it can be a numeric expression. + </p> + + <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/> + + <p> + See <xref href="impala_order_by.xml#order_by"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/limit_and_offset"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/subquery_no_limit"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how the <codeph>LIMIT</codeph> clause caps the size of the result set, with the + limit being applied after any other clauses such as <codeph>WHERE</codeph>. + </p> + +<codeblock>[localhost:21000] > create database limits; +[localhost:21000] > use limits; +[localhost:21000] > create table numbers (x int); +[localhost:21000] > insert into numbers values (1), (3), (4), (5), (2); +Inserted 5 rows in 1.34s +[localhost:21000] > select x from numbers limit 100; ++---+ +| x | ++---+ +| 1 | +| 3 | +| 4 | +| 5 | +| 2 | ++---+ +Returned 5 row(s) in 0.26s +[localhost:21000] > select x from numbers limit 3; ++---+ +| x | ++---+ +| 1 | +| 3 | +| 4 | ++---+ +Returned 3 row(s) in 0.27s +[localhost:21000] > select x from numbers where x > 2 limit 2; ++---+ +| x | ++---+ +| 3 | +| 4 | ++---+ +Returned 2 row(s) in 0.27s</codeblock> + + <p> + For top-N and bottom-N queries, you use the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses + together: + </p> + +<codeblock rev="obwl">[localhost:21000] > select x as "Top 3" from numbers order by x desc limit 3; ++-------+ +| top 3 | ++-------+ +| 5 | +| 4 | +| 3 | ++-------+ +[localhost:21000] > select x as "Bottom 3" from numbers order by x limit 3; ++----------+ +| bottom 3 | ++----------+ +| 1 | +| 2 | +| 3 | ++----------+ +</codeblock> + + <p> + You can use constant values besides integer literals as the <codeph>LIMIT</codeph> argument: + </p> + +<codeblock>-- Other expressions that yield constant integer values work too. +SELECT x FROM t1 LIMIT 1e6; -- Limit is one million. +SELECT x FROM t1 LIMIT length('hello world'); -- Limit is 11. +SELECT x FROM t1 LIMIT 2+2; -- Limit is 4. +SELECT x FROM t1 LIMIT cast(truncate(9.9) AS INT); -- Limit is 9. +</codeblock> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_lineage.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_lineage.xml b/docs/topics/impala_lineage.xml new file mode 100644 index 0000000..677fc3d --- /dev/null +++ b/docs/topics/impala_lineage.xml @@ -0,0 +1,113 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="lineage" rev="2.2.0"> + + <title>Viewing Lineage Information for Impala Data</title> + <titlealts audience="PDF"><navtitle>Viewing Lineage Info</navtitle></titlealts> + <prolog> + + <metadata> + + <data name="Category" value="Impala"/> + <data name="Category" value="Lineage"/> + <data name="Category" value="Governance"/> + <data name="Category" value="Data Management"/> + <data name="Category" value="Navigator"/> + <data name="Category" value="Administrators"/> + + </metadata> + + </prolog> + + <conbody> + + <p rev="2.2.0"> + <indexterm audience="Cloudera">lineage</indexterm> + <indexterm audience="Cloudera">column lineage</indexterm> + <term>Lineage</term> is a feature in the Cloudera Navigator data + management component that helps you track where data originated, and how + data propagates through the system through SQL statements such as + <codeph>SELECT</codeph>, <codeph>INSERT</codeph>, and <codeph>CREATE + TABLE AS SELECT</codeph>. Impala is covered by the Cloudera Navigator + lineage features in <keyword keyref="impala22_full"/> and higher. </p> + + <p> + This type of tracking is important in high-security configurations, especially in highly regulated industries + such as healthcare, pharmaceuticals, financial services and intelligence. For such kinds of sensitive data, it is important to know all + the places in the system that contain that data or other data derived from it; to verify who has accessed + that data; and to be able to doublecheck that the data used to make a decision was processed correctly and + not tampered with. + </p> + + <p> + You interact with this feature through <term>lineage diagrams</term> showing relationships between tables and + columns. For instructions about interpreting lineage diagrams, see + <xref audience="integrated" href="cn_iu_lineage.xml" /><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/cn_iu_lineage.html" scope="external" format="html"/>. + </p> + + <section id="column_lineage"> + + <title>Column Lineage</title> + + <p> + <term>Column lineage</term> tracks information in fine detail, at the level of + particular columns rather than entire tables. + </p> + + <p> + For example, if you have a table with information derived from web logs, you might copy that data into + other tables as part of the ETL process. The ETL operations might involve transformations through + expressions and function calls, and rearranging the columns into more or fewer tables + (<term>normalizing</term> or <term>denormalizing</term> the data). Then for reporting, you might issue + queries against multiple tables and views. In this example, column lineage helps you determine that data + that entered the system as <codeph>RAW_LOGS.FIELD1</codeph> was then turned into + <codeph>WEBSITE_REPORTS.IP_ADDRESS</codeph> through an <codeph>INSERT ... SELECT</codeph> statement. Or, + conversely, you could start with a reporting query against a view, and trace the origin of the data in a + field such as <codeph>TOP_10_VISITORS.USER_ID</codeph> back to the underlying table and even further back + to the point where the data was first loaded into Impala. + </p> + + <p> + When you have tables where you need to track or control access to sensitive information at the column + level, see <xref href="impala_authorization.xml#authorization"/> for how to implement column-level + security. You set up authorization using the Sentry framework, create views that refer to specific sets of + columns, and then assign authorization privileges to those views rather than the underlying tables. + </p> + + </section> + + <section id="lineage_data"> + + <title>Lineage Data for Impala</title> + + <p> + The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage + graph is computed for each query and stored in a specialized log file in JSON format. + </p> + + <p> + Impala records queries in the lineage log if they complete successfully, or fail due to authorization + errors. For write operations such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>, + the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage + feature tracks data that was accessed by successful queries, or that was attempted to be accessed by + unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data + that really was accessed, or where the attempted access could represent malicious activity. + </p> + + <p> + Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are + cancelled before they reach the stage of requesting rows from the result set. + </p> + + <p> + To enable or disable this feature on a system not managed by Cloudera Manager, set or remove the + <codeph>-lineage_event_log_dir</codeph> configuration option for the <cmdname>impalad</cmdname> daemon. For + information about turning the lineage feature on and off through Cloudera Manager, see + <xref audience="integrated" href="datamgmt_impala_lineage_log.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/datamgmt_impala_lineage_log.html" scope="external" format="html"/>. + </p> + + </section> + + </conbody> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_literals.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_literals.xml b/docs/topics/impala_literals.xml new file mode 100644 index 0000000..d84d84c --- /dev/null +++ b/docs/topics/impala_literals.xml @@ -0,0 +1,384 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="literals"> + + <title>Literals</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">literals</indexterm> + Each of the Impala data types has corresponding notation for literal values of that type. You specify literal + values in SQL statements, such as in the <codeph>SELECT</codeph> list or <codeph>WHERE</codeph> clause of a + query, or as an argument to a function call. See <xref href="impala_datatypes.xml#datatypes"/> for a complete + list of types, ranges, and conversion rules. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="numeric_literals"> + + <title>Numeric Literals</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">numeric literals</indexterm> + To write literals for the integer types (<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, + <codeph>INT</codeph>, and <codeph>BIGINT</codeph>), use a sequence of digits with optional leading zeros. + </p> + + <p rev="1.4.0"> + To write literals for the floating-point types (<codeph rev="1.4.0">DECIMAL</codeph>, + <codeph>FLOAT</codeph>, and <codeph>DOUBLE</codeph>), use a sequence of digits with an optional decimal + point (<codeph>.</codeph> character). To preserve accuracy during arithmetic expressions, Impala interprets + floating-point literals as the <codeph>DECIMAL</codeph> type with the smallest appropriate precision and + scale, until required by the context to convert the result to <codeph>FLOAT</codeph> or + <codeph>DOUBLE</codeph>. + </p> + + <p> + Integer values are promoted to floating-point when necessary, based on the context. + </p> + + <p> + You can also use exponential notation by including an <codeph>e</codeph> character. For example, + <codeph>1e6</codeph> is 1 times 10 to the power of 6 (1 million). A number in exponential notation is + always interpreted as floating-point. + </p> + + <p rev="tk"> + When Impala encounters a numeric literal, it considers the type to be the <q>smallest</q> that can + accurately represent the value. The type is promoted to larger or more accurate types if necessary, based + on subsequent parts of an expression. + </p> + <p> + For example, you can see by the types Impala defines for the following table columns + how it interprets the corresponding numeric literals: + </p> +<codeblock>[localhost:21000] > create table ten as select 10 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc ten; ++------+---------+---------+ +| name | type | comment | ++------+---------+---------+ +| x | tinyint | | ++------+---------+---------+ + +[localhost:21000] > create table four_k as select 4096 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc four_k; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| x | smallint | | ++------+----------+---------+ + +[localhost:21000] > create table one_point_five as select 1.5 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc one_point_five; ++------+--------------+---------+ +| name | type | comment | ++------+--------------+---------+ +| x | decimal(2,1) | | ++------+--------------+---------+ + +[localhost:21000] > create table one_point_three_three_three as select 1.333 as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ +[localhost:21000] > desc one_point_three_three_three; ++------+--------------+---------+ +| name | type | comment | ++------+--------------+---------+ +| x | decimal(4,3) | | ++------+--------------+---------+ +</codeblock> + </conbody> + </concept> + + <concept id="string_literals"> + + <title>String Literals</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">string literals</indexterm> + String literals are quoted using either single or double quotation marks. You can use either kind of quotes + for string literals, even both kinds for different literals within the same statement. + </p> + + <p rev="2.0.0"> + Quoted literals are considered to be of type <codeph>STRING</codeph>. To use quoted literals in contexts + requiring a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> value, <codeph>CAST()</codeph> the literal to + a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> of the appropriate length. + </p> + + <p> + <b>Escaping special characters:</b> + </p> + + <p> + To encode special characters within a string literal, precede them with the backslash (<codeph>\</codeph>) + escape character: + </p> + + <ul> + <li> + <codeph>\t</codeph> represents a tab. + </li> + + <li> + <codeph>\n</codeph> represents a newline or linefeed. This might cause extra line breaks in + <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\r</codeph> represents a carriage return. This might cause unusual formatting (making it appear + that some content is overwritten) in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\b</codeph> represents a backspace. This might cause unusual formatting (making it appear that + some content is overwritten) in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\0</codeph> represents an ASCII <codeph>nul</codeph> character (not the same as a SQL + <codeph>NULL</codeph>). This might not be visible in <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\Z</codeph> represents a DOS end-of-file character. This might not be visible in + <cmdname>impala-shell</cmdname> output. + </li> + + <li> + <codeph>\%</codeph> and <codeph>\_</codeph> can be used to escape wildcard characters within the string + passed to the <codeph>LIKE</codeph> operator. + </li> + + <li> + <codeph>\</codeph> followed by 3 octal digits represents the ASCII code of a single character; for + example, <codeph>\101</codeph> is ASCII 65, the character <codeph>A</codeph>. + </li> + + <li> + Use two consecutive backslashes (<codeph>\\</codeph>) to prevent the backslash from being interpreted as + an escape character. + </li> + + <li> + Use the backslash to escape single or double quotation mark characters within a string literal, if the + literal is enclosed by the same type of quotation mark. + </li> + + <li> + If the character following the <codeph>\</codeph> does not represent the start of a recognized escape + sequence, the character is passed through unchanged. + </li> + </ul> + + <p> + <b>Quotes within quotes:</b> + </p> + + <p> + To include a single quotation character within a string value, enclose the literal with either single or + double quotation marks, and optionally escape the single quote as a <codeph>\'</codeph> sequence. Earlier + releases required escaping a single quote inside double quotes. Continue using escape sequences in this + case if you also need to run your SQL code on older versions of Impala. + </p> + + <p> + To include a double quotation character within a string value, enclose the literal with single quotation + marks, no escaping is necessary in this case. Or, enclose the literal with double quotation marks and + escape the double quote as a <codeph>\"</codeph> sequence. + </p> + +<codeblock>[localhost:21000] > select "What\'s happening?" as single_within_double, + > 'I\'m not sure.' as single_within_single, + > "Homer wrote \"The Iliad\"." as double_within_double, + > 'Homer also wrote "The Odyssey".' as double_within_single; ++----------------------+----------------------+--------------------------+---------------------------------+ +| single_within_double | single_within_single | double_within_double | double_within_single | ++----------------------+----------------------+--------------------------+---------------------------------+ +| What's happening? | I'm not sure. | Homer wrote "The Iliad". | Homer also wrote "The Odyssey". | ++----------------------+----------------------+--------------------------+---------------------------------+ +</codeblock> + + <p> + <b>Field terminator character in CREATE TABLE:</b> + </p> + + <note conref="../shared/impala_common.xml#common/thorn"/> + + <p> + <b>impala-shell considerations:</b> + </p> + + <p> + When dealing with output that includes non-ASCII or non-printable characters such as linefeeds and + backspaces, use the <cmdname>impala-shell</cmdname> options to save to a file, turn off pretty printing, or + both rather than relying on how the output appears visually. See + <xref href="impala_shell_options.xml#shell_options"/> for a list of <cmdname>impala-shell</cmdname> + options. + </p> + </conbody> + </concept> + + <concept id="boolean_literals"> + + <title>Boolean Literals</title> + + <conbody> + + <p> + For <codeph>BOOLEAN</codeph> values, the literals are <codeph>TRUE</codeph> and <codeph>FALSE</codeph>, + with no quotation marks and case-insensitive. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>select true; +select * from t1 where assertion = false; +select case bool_col when true then 'yes' when false 'no' else 'null' end from t1;</codeblock> + </conbody> + </concept> + + <concept id="timestamp_literals"> + + <title>Timestamp Literals</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/timestamp_conversions"/> + + <p> + You can also use <codeph>INTERVAL</codeph> expressions to add or subtract from timestamp literal values, + such as <codeph>'1966-07-30' + INTERVAL 5 YEARS + INTERVAL 3 DAYS</codeph>. See + <xref href="impala_timestamp.xml#timestamp"/> for details. + </p> + + <p> + Depending on your data pipeline, you might receive date and time data as text, in notation that does not + exactly match the format for Impala <codeph>TIMESTAMP</codeph> literals. + See <xref href="impala_datetime_functions.xml#datetime_functions"/> for functions that can convert + between a variety of string literals (including different field order, separators, and timezone notation) + and equivalent <codeph>TIMESTAMP</codeph> or numeric values. + </p> + </conbody> + </concept> + + <concept id="null"> + + <title>NULL</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">NULL</indexterm> + The notion of <codeph>NULL</codeph> values is familiar from all kinds of database systems, but each SQL + dialect can have its own behavior and restrictions on <codeph>NULL</codeph> values. For Big Data + processing, the precise semantics of <codeph>NULL</codeph> values are significant: any misunderstanding + could lead to inaccurate results or misformatted data, that could be time-consuming to correct for large + data sets. + </p> + + <ul> + <li> + <codeph>NULL</codeph> is a different value than an empty string. The empty string is represented by a + string literal with nothing inside, <codeph>""</codeph> or <codeph>''</codeph>. + </li> + + <li> + In a delimited text file, the <codeph>NULL</codeph> value is represented by the special token + <codeph>\N</codeph>. + </li> + + <li> + When Impala inserts data into a partitioned table, and the value of one of the partitioning columns is + <codeph>NULL</codeph> or the empty string, the data is placed in a special partition that holds only + these two kinds of values. When these values are returned in a query, the result is <codeph>NULL</codeph> + whether the value was originally <codeph>NULL</codeph> or an empty string. This behavior is compatible + with the way Hive treats <codeph>NULL</codeph> values in partitioned tables. Hive does not allow empty + strings as partition keys, and it returns a string value such as + <codeph>__HIVE_DEFAULT_PARTITION__</codeph> instead of <codeph>NULL</codeph> when such values are + returned from a query. For example: +<codeblock>create table t1 (i int) partitioned by (x int, y string); +-- Select an INT column from another table, with all rows going into a special HDFS subdirectory +-- named __HIVE_DEFAULT_PARTITION__. Depending on whether one or both of the partitioning keys +-- are null, this special directory name occurs at different levels of the physical data directory +-- for the table. +insert into t1 partition(x=NULL, y=NULL) select c1 from some_other_table; +insert into t1 partition(x, y=NULL) select c1, c2 from some_other_table; +insert into t1 partition(x=NULL, y) select c1, c3 from some_other_table;</codeblock> + </li> + + <li> + There is no <codeph>NOT NULL</codeph> clause when defining a column to prevent <codeph>NULL</codeph> + values in that column. + </li> + + <li> + There is no <codeph>DEFAULT</codeph> clause to specify a non-<codeph>NULL</codeph> default value. + </li> + + <li> + If an <codeph>INSERT</codeph> operation mentions some columns but not others, the unmentioned columns + contain <codeph>NULL</codeph> for all inserted rows. + </li> + + <li rev="1.2.1"> + <p conref="../shared/impala_common.xml#common/null_sorting_change"/> + <note> + <!-- To do: Probably a bunch of similar view-related restrictions like this that should be collected, reused, or cross-referenced under the Views topic. --> + Because the <codeph>NULLS FIRST</codeph> and <codeph>NULLS LAST</codeph> keywords are not currently + available in Hive queries, any views you create using those keywords will not be available through + Hive. + </note> + </li> + + <li> + In all other contexts besides sorting with <codeph>ORDER BY</codeph>, comparing a <codeph>NULL</codeph> + to anything else returns <codeph>NULL</codeph>, making the comparison meaningless. For example, + <codeph>10 > NULL</codeph> produces <codeph>NULL</codeph>, <codeph>10 < NULL</codeph> also produces + <codeph>NULL</codeph>, <codeph>5 BETWEEN 1 AND NULL</codeph> produces <codeph>NULL</codeph>, and so on. + </li> + </ul> + + <p> + Several built-in functions serve as shorthand for evaluating expressions and returning + <codeph>NULL</codeph>, 0, or some other substitution value depending on the expression result: + <codeph>ifnull()</codeph>, <codeph>isnull()</codeph>, <codeph>nvl()</codeph>, <codeph>nullif()</codeph>, + <codeph>nullifzero()</codeph>, and <codeph>zeroifnull()</codeph>. See + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_live_progress.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_live_progress.xml b/docs/topics/impala_live_progress.xml new file mode 100644 index 0000000..f99e4b1 --- /dev/null +++ b/docs/topics/impala_live_progress.xml @@ -0,0 +1,86 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.3.0" id="live_progress"> + + <title>LIVE_PROGRESS Query Option (<keyword keyref="impala23"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>LIVE_PROGRESS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Reports"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.3.0"> + <indexterm audience="Cloudera">LIVE_PROGRESS query option</indexterm> + For queries submitted through the <cmdname>impala-shell</cmdname> command, + displays an interactive progress bar showing roughly what percentage of + processing has been completed. When the query finishes, the progress bar is erased + from the <cmdname>impala-shell</cmdname> console output. + </p> + + <p> + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/command_line_blurb"/> + <p> + You can enable this query option within <cmdname>impala-shell</cmdname> + by starting the shell with the <codeph>--live_progress</codeph> + command-line option. + You can still turn this setting off and on again within the shell through the + <codeph>SET</codeph> command. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p conref="../shared/impala_common.xml#common/live_reporting_details"/> + <p> + For a more detailed way of tracking the progress of an interactive query through + all phases of processing, see <xref href="impala_live_summary.xml#live_summary"/>. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + <p> + Because the percentage complete figure is calculated using the number of + issued and completed <q>scan ranges</q>, which occur while reading the table + data, the progress bar might reach 100% before the query is entirely finished. + For example, the query might do work to perform aggregations after all the + table data has been read. If many of your queries fall into this category, + consider using the <codeph>LIVE_SUMMARY</codeph> option instead for + more granular progress reporting. + </p> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_compute_stats_caveat"/> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_shell_only_caveat"/> + + <p conref="../shared/impala_common.xml#common/added_in_230"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> +<codeblock><![CDATA[[localhost:21000] > set live_progress=true; +LIVE_PROGRESS set to true +[localhost:21000] > select count(*) from customer; ++----------+ +| count(*) | ++----------+ +| 150000 | ++----------+ +[localhost:21000] > select count(*) from customer t1 cross join customer t2; +[################################### ] 50% +[######################################################################] 100% + +]]> +</codeblock> + + <p conref="../shared/impala_common.xml#common/live_progress_live_summary_asciinema"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_live_summary.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_live_summary.xml b/docs/topics/impala_live_summary.xml new file mode 100644 index 0000000..9c7329b --- /dev/null +++ b/docs/topics/impala_live_summary.xml @@ -0,0 +1,211 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.3.0" id="live_summary"> + + <title>LIVE_SUMMARY Query Option (<keyword keyref="impala23"/> or higher only)</title> + <titlealts audience="PDF"><navtitle>LIVE_SUMMARY</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Reports"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.3.0"> + <indexterm audience="Cloudera">LIVE_SUMMARY query option</indexterm> + For queries submitted through the <cmdname>impala-shell</cmdname> command, + displays the same output as the <codeph>SUMMARY</codeph> command, + with the measurements updated in real time as the query progresses. + When the query finishes, the final <codeph>SUMMARY</codeph> output remains + visible in the <cmdname>impala-shell</cmdname> console output. + </p> + + <p> + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/command_line_blurb"/> + <p> + You can enable this query option within <cmdname>impala-shell</cmdname> + by starting the shell with the <codeph>--live_summary</codeph> + command-line option. + You can still turn this setting off and on again within the shell through the + <codeph>SET</codeph> command. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + The live summary output can be useful for evaluating long-running queries, + to evaluate which phase of execution takes up the most time, or if some hosts + take much longer than others for certain operations, dragging overall performance down. + By making the information available in real time, this feature lets you decide what + action to take even before you cancel a query that is taking much longer than normal. + </p> + <p> + For example, you might see the HDFS scan phase taking a long time, and therefore revisit + performance-related aspects of your schema design such as constructing a partitioned table, + switching to the Parquet file format, running the <codeph>COMPUTE STATS</codeph> statement + for the table, and so on. + Or you might see a wide variation between the average and maximum times for all hosts to + perform some phase of the query, and therefore investigate if one particular host + needed more memory or was experiencing a network problem. + </p> + <p conref="../shared/impala_common.xml#common/live_reporting_details"/> + <p> + For a simple and concise way of tracking the progress of an interactive query, see + <xref href="impala_live_progress.xml#live_progress"/>. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_compute_stats_caveat"/> + <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_shell_only_caveat"/> + + <p conref="../shared/impala_common.xml#common/added_in_230"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows a series of <codeph>LIVE_SUMMARY</codeph> reports that + are displayed during the course of a query, showing how the numbers increase to + show the progress of different phases of the distributed query. When you do the same + in <cmdname>impala-shell</cmdname>, only a single report is displayed at any one time, + with each update overwriting the previous numbers. + </p> + +<codeblock><![CDATA[[localhost:21000] > set live_summary=true; +LIVE_SUMMARY set to true +[localhost:21000] > select count(*) from customer t1 cross join customer t2; ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | 10.00 MB | | +| 02:NESTED LOOP JOIN | 0 | 0ns | 0ns | 0 | 22.50B | 0 B | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 0 | 0ns | 0ns | 0 | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 0 | 0ns | 0ns | 0 | 150.00K | 0 B | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 17.62s | 17.62s | 81.14M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.53ms | 247.53ms | 1.02K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 61.85s | 61.85s | 283.43M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.59ms | 247.59ms | 2.05K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +]]> +</codeblock> + +<!-- Keeping this sample output that illustrates a couple of glitches in the LIVE_SUMMARY display, hidden, to help filing JIRAs. --> +<codeblock audience="Cloudera"><![CDATA[[ ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 91.34s | 91.34s | 419.48M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.63ms | 247.63ms | 3.07K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 140.49s | 140.49s | 646.82M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.73ms | 247.73ms | 5.12K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 228.96s | 228.96s | 1.06B | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.83ms | 247.83ms | 7.17K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 563.11s | 563.11s | 2.59B | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 248.11ms | 248.11ms | 17.41K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 985.71s | 985.71s | 4.54B | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 248.49ms | 248.49ms | 30.72K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | None | None | 5.42B | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 248.66ms | 248.66ms | 36.86K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + +[localhost:21000] > select count(*) from customer t1 cross join customer t2; +Query: select count(*) from customer t1 cross join customer t2 +[####################################################################################################] 100% ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +]]> +</codeblock> + + <p conref="../shared/impala_common.xml#common/live_progress_live_summary_asciinema"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_load_data.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_load_data.xml b/docs/topics/impala_load_data.xml new file mode 100644 index 0000000..0652914 --- /dev/null +++ b/docs/topics/impala_load_data.xml @@ -0,0 +1,256 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.1" id="load_data"> + + <title>LOAD DATA Statement</title> + <titlealts audience="PDF"><navtitle>LOAD DATA</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="HDFS"/> + <data name="Category" value="Tables"/> + <data name="Category" value="S3"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">LOAD DATA statement</indexterm> + The <codeph>LOAD DATA</codeph> statement streamlines the ETL process for an internal Impala table by moving a + data file or all the data files in a directory from an HDFS location into the Impala data directory for that + table. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>LOAD DATA INPATH '<varname>hdfs_file_or_directory_path</varname>' [OVERWRITE] INTO TABLE <varname>tablename</varname> + [PARTITION (<varname>partcol1</varname>=<varname>val1</varname>, <varname>partcol2</varname>=<varname>val2</varname> ...)]</codeblock> + + <p> + When the <codeph>LOAD DATA</codeph> statement operates on a partitioned table, + it always operates on one partition at a time. Specify the <codeph>PARTITION</codeph> clauses + and list all the partition key columns, with a constant value specified for each. + </p> + + <p conref="../shared/impala_common.xml#common/dml_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <ul> + <li> + The loaded data files are moved, not copied, into the Impala data directory. + </li> + + <li> + You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the + files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a + directory. When loading a directory full of data files, keep all the data files at the top level, with no + nested directories underneath. + </li> + + <li> + Currently, the Impala <codeph>LOAD DATA</codeph> statement only imports files from HDFS, not from the local + filesystem. It does not support the <codeph>LOCAL</codeph> keyword of the Hive <codeph>LOAD DATA</codeph> + statement. You must specify a path, not an <codeph>hdfs://</codeph> URI. + </li> + + <li> + In the interest of speed, only limited error checking is done. If the loaded files have the wrong file + format, different columns than the destination table, or other kind of mismatch, Impala does not raise any + error for the <codeph>LOAD DATA</codeph> statement. Querying the table afterward could produce a runtime + error or unexpected results. Currently, the only checking the <codeph>LOAD DATA</codeph> statement does is + to avoid mixing together uncompressed and LZO-compressed text files in the same table. + </li> + + <li> + When you specify an HDFS directory name as the <codeph>LOAD DATA</codeph> argument, any hidden files in + that directory (files whose names start with a <codeph>.</codeph>) are not moved to the Impala data + directory. + </li> + + <li rev="2.5.0 IMPALA-2867"> + The operation fails if the source directory contains any non-hidden directories. + Prior to <keyword keyref="impala25_full"/> if the source directory contained any subdirectory, even a hidden one such as + <filepath>_impala_insert_staging</filepath>, the <codeph>LOAD DATA</codeph> statement would fail. + In <keyword keyref="impala25_full"/> and higher, <codeph>LOAD DATA</codeph> ignores hidden subdirectories in the + source directory, and only fails if any of the subdirectories are non-hidden. + </li> + + <li> + The loaded data files retain their original names in the new location, unless a name conflicts with an + existing data file, in which case the name of the new file is modified slightly to be unique. (The + name-mangling is a slight difference from the Hive <codeph>LOAD DATA</codeph> statement, which replaces + identically named files.) + </li> + + <li> + By providing an easy way to transport files from known locations in HDFS into the Impala data directory + structure, the <codeph>LOAD DATA</codeph> statement lets you avoid memorizing the locations and layout of + HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of + the data files for an Impala table, issue the statement <codeph>DESCRIBE FORMATTED + <varname>table_name</varname></codeph>.) + </li> + + <li> + The <codeph>PARTITION</codeph> clause is especially convenient for ingesting new data for a partitioned + table. As you receive new data for a time period, geographic region, or other division that corresponds to + one or more partitioning columns, you can load that data straight into the appropriate Impala data + directory, which might be nested several levels down if the table is partitioned by multiple columns. When + the table is partitioned, you must specify constant values for all the partitioning columns. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + Because Impala currently cannot create Parquet data files containing complex types + (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>), the + <codeph>LOAD DATA</codeph> statement is especially important when working with + tables containing complex type columns. You create the Parquet data files outside + Impala, then use either <codeph>LOAD DATA</codeph>, an external table, or HDFS-level + file operations followed by <codeph>REFRESH</codeph> to associate the data files with + the corresponding table. + See <xref href="impala_complex_types.xml#complex_types"/> for details about using complex types. + </p> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <note conref="../shared/impala_common.xml#common/compute_stats_next"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + First, we use a trivial Python script to write different numbers of strings (one per line) into files stored + in the <codeph>cloudera</codeph> HDFS user account. (Substitute the path for your own HDFS user account when + doing <cmdname>hdfs dfs</cmdname> operations like these.) + </p> + +<codeblock>$ random_strings.py 1000 | hdfs dfs -put - /user/cloudera/thousand_strings.txt +$ random_strings.py 100 | hdfs dfs -put - /user/cloudera/hundred_strings.txt +$ random_strings.py 10 | hdfs dfs -put - /user/cloudera/ten_strings.txt</codeblock> + + <p> + Next, we create a table and load an initial set of data into it. Remember, unless you specify a + <codeph>STORED AS</codeph> clause, Impala tables default to <codeph>TEXTFILE</codeph> format with Ctrl-A (hex + 01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For + large-scale ETL jobs, you would typically use binary format data files such as Parquet or Avro, and load them + into Impala tables that use the corresponding file format. + </p> + +<codeblock>[localhost:21000] > create table t1 (s string); +[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.61s +[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1; +Query finished, fetching results ... ++------+ +| _c0 | ++------+ +| 1000 | ++------+ +Returned 1 row(s) in 0.67s +[localhost:21000] > load data inpath '/user/cloudera/thousand_strings.txt' into table t1; +ERROR: AnalysisException: INPATH location '/user/cloudera/thousand_strings.txt' does not exist. </codeblock> + + <p> + As indicated by the message at the end of the previous example, the data file was moved from its original + location. The following example illustrates how the data file was moved into the Impala data directory for + the destination table, keeping its original filename: + </p> + +<codeblock>$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1 +Found 1 items +-rw-r--r-- 1 cloudera cloudera 13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt</codeblock> + + <p> + The following example demonstrates the difference between the <codeph>INTO TABLE</codeph> and + <codeph>OVERWRITE TABLE</codeph> clauses. The table already contains 1000 rows. After issuing the + <codeph>LOAD DATA</codeph> statement with the <codeph>INTO TABLE</codeph> clause, the table contains 100 more + rows, for a total of 1100. After issuing the <codeph>LOAD DATA</codeph> statement with the <codeph>OVERWRITE + INTO TABLE</codeph> clause, the former contents are gone, and now the table only contains the 10 rows from + the just-loaded data file. + </p> + +<codeblock>[localhost:21000] > load data inpath '/user/cloudera/hundred_strings.txt' into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 2 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.24s +[localhost:21000] > select count(*) from t1; +Query finished, fetching results ... ++------+ +| _c0 | ++------+ +| 1100 | ++------+ +Returned 1 row(s) in 0.55s +[localhost:21000] > load data inpath '/user/cloudera/ten_strings.txt' overwrite into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.26s +[localhost:21000] > select count(*) from t1; +Query finished, fetching results ... ++-----+ +| _c0 | ++-----+ +| 10 | ++-----+ +Returned 1 row(s) in 0.62s</codeblock> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_dml"/> + <p conref="../shared/impala_common.xml#common/s3_dml_performance"/> + <p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read and write + permissions for the files in the source directory, and write + permission for the destination directory. + </p> + + <p conref="../shared/impala_common.xml#common/kudu_blurb"/> + <p conref="../shared/impala_common.xml#common/kudu_no_load_data"/> + + <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + <p conref="../shared/impala_common.xml#common/hbase_no_load_data"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + The <codeph>LOAD DATA</codeph> statement is an alternative to the + <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement. + Use <codeph>LOAD DATA</codeph> + when you have the data files in HDFS but outside of any Impala table. + </p> + <p> + The <codeph>LOAD DATA</codeph> statement is also an alternative + to the <codeph>CREATE EXTERNAL TABLE</codeph> statement. Use + <codeph>LOAD DATA</codeph> when it is appropriate to move the + data files under Impala control rather than querying them + from their original location. See <xref href="impala_tables.xml#external_tables"/> + for information about working with external tables. + </p> + </conbody> +</concept>
