http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/1fcc8cee/docs/topics/impala_config_performance.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_config_performance.xml b/docs/topics/impala_config_performance.xml new file mode 100644 index 0000000..837e63c --- /dev/null +++ b/docs/topics/impala_config_performance.xml @@ -0,0 +1,291 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="config_performance"> + + <title>Post-Installation Configuration for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + + <conbody> + + <p id="p_24"> + This section describes the mandatory and recommended configuration settings for Impala. If Impala is + installed using Cloudera Manager, some of these configurations are completed automatically; you must still + configure short-circuit reads manually. If you installed Impala without Cloudera Manager, or if you want to + customize your environment, consider making the changes described in this topic. + </p> + + <p> +<!-- Could conref this paragraph from ciiu_install.xml. --> + In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular + component configuration details in one of the free-form fields on the Impala configuration pages within + Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/> + </p> + + <ul> + <li> + You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This + setting goes in the Impala configuration settings, not the Hadoop-wide settings. + </li> + + <li> + If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block + location tracking, and you can optionally enable native checksumming for optimal performance. + </li> + + <li> + If you deployed Impala using Cloudera Manager see + <xref href="impala_perf_testing.xml#performance_testing"/> to confirm proper configuration. + </li> + </ul> + + <section id="section_fhq_wyv_ls"> + <title>Mandatory: Short-Circuit Reads</title> + <p> Enabling short-circuit reads allows Impala to read local data directly + from the file system. This removes the need to communicate through the + DataNodes, improving performance. This setting also minimizes the number + of additional copies of data. Short-circuit reads requires + <codeph>libhadoop.so</codeph> + <!-- This link went stale. Not obvious how to keep it in sync with whatever Hadoop CDH is using behind the scenes. So hide the link for now. --> + <!-- (the <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>) --> + (the Hadoop Native Library) to be accessible to both the server and the + client. <codeph>libhadoop.so</codeph> is not available if you have + installed from a tarball. You must install from an + <codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use + short-circuit local reads. <note> If you use Cloudera Manager, you can + enable short-circuit reads through a checkbox in the user interface + and that setting takes effect for Impala as well. </note> + </p> + <p> Cloudera strongly recommends using Impala with CDH 4.2 or higher, + ideally the latest 4.x release. Impala does support short-circuit reads + with CDH 4.1, but for best performance, upgrade to CDH 4.3 or higher. + The process of configuring short-circuit reads varies according to which + version of CDH you are using. Choose the procedure that is appropriate + for your environment. </p> + <p> + <b>To configure DataNodes for short-circuit reads with CDH 4.2 or + higher:</b> + </p> + <ol id="ol_qlq_wyv_ls"> + <li id="copy_config_files"> Copy the client + <codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph> + configuration files from the Hadoop configuration directory to the + Impala configuration directory. The default Impala configuration + location is <codeph>/etc/impala/conf</codeph>. </li> + <li> + <indexterm audience="Cloudera" + >dfs.client.read.shortcircuit</indexterm> + <indexterm audience="Cloudera">dfs.domain.socket.path</indexterm> + <indexterm audience="Cloudera" + >dfs.client.file-block-storage-locations.timeout.millis</indexterm> + On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf. + Which wouldn't work if we already modified the Impala version of the file here. Not to mention that this + doesn't take the CM interface into account, where these /etc files might not exist in those locations. --> + <!-- <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: --> + Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock><property> + <name>dfs.client.read.shortcircuit</name> + <value>true</value> +</property> + +<property> + <name>dfs.domain.socket.path</name> + <value>/var/run/hdfs-sockets/dn</value> +</property> + +<property> + <name>dfs.client.file-block-storage-locations.timeout.millis</name> + <value>10000</value> +</property></codeblock> + <!-- Former socket.path value: <value>/var/run/hadoop-hdfs/dn._PORT</value> --> + <!-- + <note> + The text <codeph>_PORT</codeph> appears just as shown; you do not need to + substitute a number. + </note> +--> + </li> + <li> + <p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make + sure its group is <codeph>root</codeph>. </p> + <note> If you are also going to enable block location tracking, you + can skip copying configuration files and restarting DataNodes and go + straight to <xref href="#config_performance/block_location_tracking" + >Optional: Block Location Tracking</xref>. + Configuring short-circuit reads and block location tracking require + the same process of copying files and restarting services, so you + can complete that process once when you have completed all + configuration changes. Whether you copy files and restart services + now or during configuring block location tracking, short-circuit + reads are not enabled until you complete those final steps. </note> + </li> + <li id="restart_all_datanodes"> After applying these changes, restart + all DataNodes. </li> + </ol> + <p> + <b>To configure DataNodes for short-circuit reads with CDH 4.1:</b> + </p> + <!-- Repeated twice, turn into a conref. --> + <note> Cloudera strongly recommends using Impala with CDH 4.2 or higher, + ideally the latest 4.x release. Impala does support short-circuit reads + with CDH 4.1, but for best performance, upgrade to CDH 4.3 or higher. + The process of configuring short-circuit reads varies according to which + version of CDH you are using. Choose the procedure that is appropriate + for your environment. </note> + <ol id="ol_cqq_wyv_ls"> + <li> Enable short-circuit reads by adding settings to the Impala + <codeph>core-site.xml</codeph> file. <ul id="ul_a5q_wyv_ls"> + <li> If you installed Impala using Cloudera Manager, short-circuit + reads should be properly configured, but you can review the + configuration by checking the contents of + the <codeph>core-site.xml</codeph> file, which is installed at + <codeph>/etc/impala/conf</codeph> by default. </li> + <li> If you installed using packages, instead of using Cloudera + Manager, create the <codeph>core-site.xml</codeph> file. This can + be easily done by copying + the <codeph>core-site.xml</codeph> client configuration file from + another machine that is running Hadoop services. This file must be + copied to the Impala configuration directory. The Impala + configuration directory is set by + the <codeph>IMPALA_CONF_DIR</codeph> environment variable and is + by default <codeph>/etc/impala/conf</codeph>. To confirm the + Impala configuration directory, check + the <codeph>IMPALA_CONF_DIR</codeph> environment variable value. + <note> If the Impala configuration directory does not exist, + create it and then add the <codeph>core-site.xml</codeph> file. + </note> + </li> + </ul> Add the following to the <codeph>core-site.xml</codeph> file: <codeblock><property> + <name>dfs.client.read.shortcircuit</name> +   <value>true</value> +</property></codeblock> + <note> For an installation managed by Cloudera Manager, specify these + settings in the Impala dialogs, in the options field for HDFS. <ph + conref="../shared/impala_common.xml#common/safety_valve" /> + </note> + </li> + <li> For each DataNode, enable access by adding the following to + the <codeph>hdfs-site.xml</codeph> file: <codeblock rev="1.3.0"><property> + <name>dfs.client.use.legacy.blockreader.local</name> + <value>true</value> +</property> + +<property> + <name>dfs.datanode.data.dir.perm</name> + <value>750</value> +</property> + +<property> + <name>dfs.block.local-path-access.user</name> + <value>impala</value> +</property> + +<property> + <name>dfs.client.file-block-storage-locations.timeout.millis</name> + <value>10000</value> +</property></codeblock> + <note> In the preceding example, + the <codeph>dfs.block.local-path-access.user</codeph> is the user + running the <codeph>impalad</codeph> process. By default, that + account is <codeph>impala</codeph>. </note> + </li> + <li> Use <codeph>usermod</codeph>  to add users requiring local block + access to the appropriate HDFS group. For example, if you + assigned <codeph>impala</codeph> to the + <codeph>dfs.block.local-path-access.user</codeph>  property, you + would add <codeph>impala</codeph>  to the hadoop HDFS group: <codeblock>$ usermod -a -G hadoop impala</codeblock> + <note> The default HDFS group is <codeph>hadoop</codeph>, but it is + possible to have an environment configured to use an alternate + group. To find the configured HDFS group name using the Cloudera + Manager Admin Console: <ol id="ol_km4_4bc_nr"> + <li>Go to the HDFS service.</li> + <li + conref="../shared/cm_common_elements.xml#cm/config_edit" /> + <li>Click <menucascade> + <uicontrol>Scope</uicontrol> + <uicontrol><varname>HDFS service name</varname> + (Service-Wide)</uicontrol> + </menucascade>.</li> + <li>Click <menucascade> + <uicontrol>Category</uicontrol> + <uicontrol>Advanced</uicontrol> + </menucascade>.</li> + <li>The <uicontrol>Shared Hadoop Group Name</uicontrol> property + contains the group name.</li> + </ol></note> + <note> If you are going to enable block location tracking, you can + skip copying configuration files and restarting DataNodes and go + straight to <xref href="#config_performance/block_location_tracking"/>. + Configuring short-circuit reads and block + location tracking require the same process of copying files and + restarting services, so you can complete that process once when you + have completed all configuration changes. Whether you copy files and + restart services now or during configuring block location tracking, + short-circuit reads are not enabled until you complete those final + steps. </note> + </li> + <li conref="#config_performance/copy_config_files" /> + <li conref="#config_performance/restart_all_datanodes" /> + </ol> + </section> + + <section id="block_location_tracking"> + + <title>Mandatory: Block Location Tracking</title> + + <p> + Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing + better utilization of the underlying disks. Impala will not start unless this setting is enabled. + </p> + + <p> + <b>To enable block location tracking:</b> + </p> + + <ol> + <li> + For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file: +<codeblock><property> + <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> + <value>true</value> +</property> </codeblock> + </li> + + <li conref="#config_performance/copy_config_files"/> + + <li conref="#config_performance/restart_all_datanodes"/> + </ol> + </section> + + <section id="native_checksumming"> + + <title>Optional: Native Checksumming</title> + + <p> + Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if + that library is available. + </p> + + <p id="p_29"> + <b>To enable native checksumming:</b> + </p> + + <p> + If you installed CDH from packages, the native checksumming library is installed and setup correctly. In + such a case, no additional steps are required. Conversely, if you installed by other means, such as with + tarballs, native checksumming may not be available due to missing shared objects. Finding the message + "<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where + applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native + checksumming, you must build and install <codeph>libhadoop.so</codeph> (the + <!-- Another instance of stale link. --> + <!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). --> + Hadoop Native Library). + </p> + </section> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/1fcc8cee/docs/topics/impala_connecting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_connecting.xml b/docs/topics/impala_connecting.xml new file mode 100644 index 0000000..354e698 --- /dev/null +++ b/docs/topics/impala_connecting.xml @@ -0,0 +1,202 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="connecting"> + + <title>Connecting to impalad through impala-shell</title> + <titlealts audience="PDF"><navtitle>Connecting to impalad</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="Network"/> + <data name="Category" value="DataNode"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + +<!-- +TK: This would be a good theme for a tutorial topic. +Lots of nuances to illustrate through sample code. +--> + + <p> + Within an <cmdname>impala-shell</cmdname> session, you can only issue queries while connected to an instance + of the <cmdname>impalad</cmdname> daemon. You can specify the connection information: + <ul> + <li> + Through command-line options when you run the <cmdname>impala-shell</cmdname> command. + </li> + <li> + Through a configuration file that is read when you run the <cmdname>impala-shell</cmdname> command. + </li> + <li> + During an <cmdname>impala-shell</cmdname> session, by issuing a <codeph>CONNECT</codeph> command. + </li> + </ul> + See <xref href="impala_shell_options.xml"/> for the command-line and configuration file options you can use. + </p> + + <p> + You can connect to any DataNode where an instance of <cmdname>impalad</cmdname> is running, + and that host coordinates the execution of all queries sent to it. + </p> + + <p> + For simplicity during development, you might always connect to the same host, perhaps running <cmdname>impala-shell</cmdname> on + the same host as <cmdname>impalad</cmdname> and specifying the hostname as <codeph>localhost</codeph>. + </p> + + <p> + In a production environment, you might enable load balancing, in which you connect to specific host/port combination + but queries are forwarded to arbitrary hosts. This technique spreads the overhead of acting as the coordinator + node among all the DataNodes in the cluster. See <xref href="impala_proxy.xml"/> for details. + </p> + + <p> + <b>To connect the Impala shell during shell startup:</b> + </p> + + <ol> + <li> + Locate the hostname of a DataNode within the cluster that is running an instance of the + <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something + other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the + port number also. + </li> + + <li> + Use the <codeph>-i</codeph> option to the + <cmdname>impala-shell</cmdname> interpreter to specify the connection information for + that instance of <cmdname>impalad</cmdname>: +<codeblock> +# When you are logged into the same machine running impalad. +# The prompt will reflect the current hostname. +$ impala-shell + +# When you are logged into the same machine running impalad. +# The host will reflect the hostname 'localhost'. +$ impala-shell -i localhost + +# When you are logged onto a different host, perhaps a client machine +# outside the Hadoop cluster. +$ impala-shell -i <varname>some.other.hostname</varname> + +# When you are logged onto a different host, and impalad is listening +# on a non-default port. Perhaps a load balancer is forwarding requests +# to a different host/port combination behind the scenes. +$ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</varname> +</codeblock> + </li> + </ol> + + <p> + <b>To connect the Impala shell after shell startup:</b> + </p> + + <ol> + <li> + Start the Impala shell with no connection: +<codeblock>$ impala-shell</codeblock> + <p> + You should see a prompt like the following: + </p> +<codeblock>Welcome to the Impala shell. Press TAB twice to see a list of available commands. + +Copyright (c) <varname>year</varname> Cloudera, Inc. All rights reserved. + +<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> +[Not connected] > </codeblock> + </li> + + <li> + Locate the hostname of a DataNode within the cluster that is running an instance of the + <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something + other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the + port number also. + </li> + + <li> + Use the <codeph>connect</codeph> command to connect to an Impala instance. Enter a command of the form: +<codeblock>[Not connected] > connect <varname>impalad-host</varname> +[<varname>impalad-host</varname>:21000] ></codeblock> + <note> + Replace <varname>impalad-host</varname> with the hostname you have configured for any DataNode running + Impala in your environment. The changed prompt indicates a successful connection. + </note> + </li> + </ol> + + <p> + <b>To start <cmdname>impala-shell</cmdname> in a specific database:</b> + </p> + + <p> + You can use all the same connection options as in previous examples. + For simplicity, these examples assume that you are logged into one of + the DataNodes that is running the <cmdname>impalad</cmdname> daemon. + </p> + + <ol> + <li> + Find the name of the database containing the relevant tables, views, and so + on that you want to operate on. + </li> + + <li> + Use the <codeph>-d</codeph> option to the + <cmdname>impala-shell</cmdname> interpreter to connect and immediately + switch to the specified database, without the need for a <codeph>USE</codeph> + statement or fully qualified names: +<codeblock> +# Subsequent queries with unqualified names operate on +# tables, views, and so on inside the database named 'staging'. +$ impala-shell -i localhost -d staging + +# It is common during development, ETL, benchmarking, and so on +# to have different databases containing the same table names +# but with different contents or layouts. +$ impala-shell -i localhost -d parquet_snappy_compression +$ impala-shell -i localhost -d parquet_gzip_compression +</codeblock> + </li> + </ol> + + <p> + <b>To run one or several statements in non-interactive mode:</b> + </p> + + <p> + You can use all the same connection options as in previous examples. + For simplicity, these examples assume that you are logged into one of + the DataNodes that is running the <cmdname>impalad</cmdname> daemon. + </p> + + <ol> + <li> + Construct a statement, or a file containing a sequence of statements, + that you want to run in an automated way, without typing or copying + and pasting each time. + </li> + + <li> + Invoke <cmdname>impala-shell</cmdname> with the <codeph>-q</codeph> option to run a single statement, or + the <codeph>-f</codeph> option to run a sequence of statements from a file. + The <cmdname>impala-shell</cmdname> command returns immediately, without going into + the interactive interpreter. +<codeblock> +# A utility command that you might run while developing shell scripts +# to manipulate HDFS files. +$ impala-shell -i localhost -d database_of_interest -q 'show tables' + +# A sequence of CREATE TABLE, CREATE VIEW, and similar DDL statements +# can go into a file to make the setup process repeatable. +$ impala-shell -i localhost -d database_of_interest -f recreate_tables.sql +</codeblock> + </li> + </ol> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/1fcc8cee/docs/topics/impala_delegation.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_delegation.xml b/docs/topics/impala_delegation.xml new file mode 100644 index 0000000..0d59761 --- /dev/null +++ b/docs/topics/impala_delegation.xml @@ -0,0 +1,88 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2" id="delegation"> + + <title>Configuring Impala Delegation for Hue and BI Tools</title> + + <prolog> + <metadata> + <data name="Category" value="Security"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Authentication"/> + <data name="Category" value="Delegation"/> + <data name="Category" value="Hue"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> +<!-- + When users connect to Impala directly through the <cmdname>impala-shell</cmdname> interpreter, the Sentry + authorization framework determines what actions they can take and what data they can see. +--> + When users submit Impala queries through a separate application, such as Hue or a business intelligence tool, + typically all requests are treated as coming from the same user. In Impala 1.2 and higher, authentication is + extended by a new feature that allows applications to pass along credentials for the users that connect to + them (known as <q>delegation</q>), and issue Impala queries with the privileges for those users. Currently, + the delegation feature is available only for Impala queries submitted through application interfaces such as + Hue and BI tools; for example, Impala cannot issue queries using the privileges of the HDFS user. + </p> + + <p> + The delegation feature is enabled by a startup option for <cmdname>impalad</cmdname>: + <codeph>--authorized_proxy_user_config</codeph>. When you specify this option, users whose names you specify + (such as <codeph>hue</codeph>) can delegate the execution of a query to another user. The query runs with the + privileges of the delegated user, not the original user such as <codeph>hue</codeph>. The name of the + delegated user is passed using the HiveServer2 configuration property <codeph>impala.doas.user</codeph>. + </p> + + <p> + You can specify a list of users that the application user can delegate to, or <codeph>*</codeph> to allow a + superuser to delegate to any other user. For example: + </p> + +<codeblock>impalad --authorized_proxy_user_config 'hue=user1,user2;admin=*' ...</codeblock> + + <note> + Make sure to use single quotes or escape characters to ensure that any <codeph>*</codeph> characters do not + undergo wildcard expansion when specified in command-line arguments. + </note> + + <p> + See <xref href="impala_config_options.xml#config_options"/> for details about adding or changing + <cmdname>impalad</cmdname> startup options. See + <xref href="http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/" scope="external" format="html">this + Cloudera blog post</xref> for background information about the delegation capability in HiveServer2. + </p> + + <p> + To set up authentication for the delegated users: + </p> + + <ul> + <li> + <p> + On the server side, configure either user/password authentication through LDAP, or Kerberos + authentication, for all the delegated users. See <xref href="impala_ldap.xml#ldap"/> or + <xref href="impala_kerberos.xml#kerberos"/> for details. + </p> + </li> + + <li> + <p> + On the client side, follow the instructions in the <q>Using User Name and Password</q> section in the + <xref href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/Connectors/PDF/Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf" scope="external" format="pdf">ODBC + driver installation guide</xref>. Then search for <q>delegation</q> in that same installation guide to + learn about the <uicontrol>Delegation UID</uicontrol> field and <codeph>DelegationUID</codeph> configuration keyword to enable the delegation feature for + ODBC-based BI tools. + </p> + </li> + </ul> + + </conbody> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/1fcc8cee/docs/topics/impala_development.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_development.xml b/docs/topics/impala_development.xml new file mode 100644 index 0000000..a2eef16 --- /dev/null +++ b/docs/topics/impala_development.xml @@ -0,0 +1,229 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="intro_dev"> + + <title>Developing Impala Applications</title> + <titlealts audience="PDF"><navtitle>Developing Applications</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The core development language with Impala is SQL. You can also use Java or other languages to interact with + Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For + specialized kinds of analysis, you can supplement the SQL built-in functions by writing + <xref href="impala_udf.xml#udfs">user-defined functions (UDFs)</xref> in C++ or Java. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="intro_sql"> + + <title>Overview of the Impala SQL Dialect</title> + <prolog> + <metadata> + <data name="Category" value="SQL"/> + <data name="Category" value="Concepts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As + such, it is familiar to users who are already familiar with running SQL queries on the Hadoop + infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in + functions. Impala also includes additional built-in functions for common industry features, to simplify + porting SQL from non-Hadoop systems. + </p> + + <p> + For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect + might seem familiar: + </p> + + <ul> + <li> + <p> + The <codeph>SELECT</codeph> statement includes familiar clauses such as <codeph>WHERE</codeph>, + <codeph>GROUP BY</codeph>, <codeph>ORDER BY</codeph>, and <codeph>WITH</codeph>. + You will find familiar notions such as + <xref href="impala_joins.xml#joins">joins</xref>, <xref href="impala_functions.xml#builtins">built-in + functions</xref> for processing strings, numbers, and dates, + <xref href="impala_aggregate_functions.xml#aggregate_functions">aggregate functions</xref>, + <xref href="impala_subqueries.xml#subqueries">subqueries</xref>, and + <xref href="impala_operators.xml#comparison_operators">comparison operators</xref> + such as <codeph>IN()</codeph> and <codeph>BETWEEN</codeph>. + The <codeph>SELECT</codeph> statement is the place where SQL standards compliance is most important. + </p> + </li> + + <li> + <p> + From the data warehousing world, you will recognize the notion of + <xref href="impala_partitioning.xml#partitioning">partitioned tables</xref>. + One or more columns serve as partition keys, and the data is physically arranged so that + queries that refer to the partition key columns in the <codeph>WHERE</codeph> clause + can skip partitions that do not match the filter conditions. For example, if you have 10 + years worth of data and use a clause such as <codeph>WHERE year = 2015</codeph>, + <codeph>WHERE year > 2010</codeph>, or <codeph>WHERE year IN (2014, 2015)</codeph>, + Impala skips all the data for non-matching years, greatly reducing the amount of I/O + for the query. + </p> + </li> + + <li rev="1.2"> + <p> + In Impala 1.2 and higher, <xref href="impala_udf.xml#udfs">UDFs</xref> let you perform custom comparisons + and transformation logic during <codeph>SELECT</codeph> and <codeph>INSERT...SELECT</codeph> statements. + </p> + </li> + </ul> + + <p> + For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect + might require some learning and practice for you to become proficient in the Hadoop environment: + </p> + + <ul> + <li> + <p> + Impala SQL is focused on queries and includes relatively little DML. There is no <codeph>UPDATE</codeph> + or <codeph>DELETE</codeph> statement. Stale data is typically discarded (by <codeph>DROP TABLE</codeph> + or <codeph>ALTER TABLE ... DROP PARTITION</codeph> statements) or replaced (by <codeph>INSERT + OVERWRITE</codeph> statements). + </p> + </li> + + <li> + <p> + All data creation is done by <codeph>INSERT</codeph> statements, which typically insert data in bulk by + querying from other tables. There are two variations, <codeph>INSERT INTO</codeph> which appends to the + existing data, and <codeph>INSERT OVERWRITE</codeph> which replaces the entire contents of a table or + partition (similar to <codeph>TRUNCATE TABLE</codeph> followed by a new <codeph>INSERT</codeph>). There + is no <codeph>INSERT ... VALUES</codeph> syntax to insert a single row. + </p> + </li> + + <li> + <p> + You often construct Impala table definitions and data files in some other environment, and then attach + Impala so that it can run real-time queries. The same data files and table metadata are shared with other + components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data + inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components + can write files in formats such as Parquet and Avro, that can then be queried by Impala. + </p> + </li> + + <li> + <p> + Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL + includes some idioms that you might find in the import utilities for traditional database systems. For + example, you can create a table that reads comma-separated or tab-separated text files, specifying the + separator in the <codeph>CREATE TABLE</codeph> statement. You can create <b>external tables</b> that read + existing data files but do not move or transform them. + </p> + </li> + + <li> + <p> + Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does + not impose length constraints on string data types. For example, you can define a database column as + <codeph>STRING</codeph> with unlimited length, rather than <codeph>CHAR(1)</codeph> or + <codeph>VARCHAR(64)</codeph>. <ph rev="2.0.0">(Although in Impala 2.0 and later, you can also use + length-constrained <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types.)</ph> + </p> + </li> + + </ul> + + <p> + <b>Related information:</b> <xref href="impala_langref.xml#langref"/>, especially + <xref href="impala_langref_sql.xml#langref_sql"/> and <xref href="impala_functions.xml#builtins"/> + </p> + </conbody> + </concept> + +<!-- Bunch of potential concept topics for future consideration. Major areas of Impala modelled on areas of discussion for Oracle Database, and distributed databases in general. --> + + <concept id="intro_datatypes" audience="Cloudera"> + + <title>Overview of Impala SQL Data Types</title> + + <conbody/> + </concept> + + <concept id="intro_network" audience="Cloudera"> + + <title>Overview of Impala Network Topology</title> + + <conbody/> + </concept> + + <concept id="intro_cluster" audience="Cloudera"> + + <title>Overview of Impala Cluster Topology</title> + + <conbody/> + </concept> + + <concept id="intro_apis"> + + <title>Overview of Impala Programming Interfaces</title> + <prolog> + <metadata> + <data name="Category" value="JDBC"/> + <data name="Category" value="ODBC"/> + <data name="Category" value="Hue"/> + </metadata> + </prolog> + + <conbody> + + <p> + You can connect and submit requests to the Impala daemons through: + </p> + + <ul> + <li> + The <codeph><xref href="impala_impala_shell.xml#impala_shell">impala-shell</xref></codeph> interactive + command interpreter. + </li> + + <li> + The <xref href="http://gethue.com/" scope="external" format="html">Hue</xref> web-based user interface. + </li> + + <li> + <xref href="impala_jdbc.xml#impala_jdbc">JDBC</xref>. + </li> + + <li> + <xref href="impala_odbc.xml#impala_odbc">ODBC</xref>. + </li> + </ul> + + <p> + With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications + running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence + tools that use the JDBC and ODBC interfaces. + </p> + + <p> + Each <codeph>impalad</codeph> daemon process, running on separate nodes in a cluster, listens to + <xref href="impala_ports.xml#ports">several ports</xref> for incoming requests. Requests from + <codeph>impala-shell</codeph> and Hue are routed to the <codeph>impalad</codeph> daemons through the same + port. The <codeph>impalad</codeph> daemons listen on separate ports for JDBC and ODBC requests. + </p> + </conbody> + </concept> +</concept>
