http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_config_options.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_config_options.xml b/docs/topics/impala_config_options.xml new file mode 100644 index 0000000..a08aafa --- /dev/null +++ b/docs/topics/impala_config_options.xml @@ -0,0 +1,593 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="config_options"> + + <title>Modifying Impala Startup Options</title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">defaults file</indexterm> + + <indexterm audience="Cloudera">configuration file</indexterm> + + <indexterm audience="Cloudera">options</indexterm> + + <indexterm audience="Cloudera">IMPALA_STATE_STORE_PORT</indexterm> + + <indexterm audience="Cloudera">IMPALA_BACKEND_PORT</indexterm> + + <indexterm audience="Cloudera">IMPALA_LOG_DIR</indexterm> + + <indexterm audience="Cloudera">IMPALA_STATE_STORE_ARGS</indexterm> + + <indexterm audience="Cloudera">IMPALA_SERVER_ARGS</indexterm> + + <indexterm audience="Cloudera">ENABLE_CORE_DUMPS</indexterm> + + <indexterm audience="Cloudera">core dumps</indexterm> + + <indexterm audience="Cloudera">restarting services</indexterm> + + <indexterm audience="Cloudera">services</indexterm> + The configuration options for the Impala-related daemons let you choose which hosts and + ports to use for the services that run on a single host, specify directories for logging, + control resource usage and security, and specify other aspects of the Impala software. + </p> + + <p outputclass="toc inpage"/> + + </conbody> + + <concept id="config_options_cm"> + + <title>Configuring Impala Startup Options through Cloudera Manager</title> + + <conbody> + + <p> + If you manage your cluster through Cloudera Manager, configure the settings for all the + Impala-related daemons by navigating to this page: + <menucascade><uicontrol>Clusters</uicontrol><uicontrol>Impala</uicontrol><uicontrol>Configuration</uicontrol><uicontrol>View + and Edit</uicontrol></menucascade>. See the Cloudera Manager documentation for + <xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_impala_service.html" scope="external" format="html">instructions + about how to configure Impala through Cloudera Manager</xref>. + </p> + + <p> + If the Cloudera Manager interface does not yet have a form field for a newly added + option, or if you need to use special options for debugging and troubleshooting, the + <uicontrol>Advanced</uicontrol> option page for each daemon includes one or more fields + where you can enter option names directly. + <ph conref="../shared/impala_common.xml#common/safety_valve"/> There is also a free-form + field for query options, on the top-level <uicontrol>Impala Daemon</uicontrol> options + page. + </p> + + </conbody> + + </concept> + + <concept id="config_options_noncm"> + + <title>Configuring Impala Startup Options through the Command Line</title> + + <conbody> + + <p> + When you run Impala in a non-Cloudera Manager environment, the Impala server, + statestore, and catalog services start up using values provided in a defaults file, + <filepath>/etc/default/impala</filepath>. + </p> + + <p> + This file includes information about many resources used by Impala. Most of the defaults + included in this file should be effective in most cases. For example, typically you + would not change the definition of the <codeph>CLASSPATH</codeph> variable, but you + would always set the address used by the statestore server. Some of the content you + might modify includes: + </p> + +<!-- Note: Update the following example for each release with the associated lines from /etc/default/impala + from a non-CM-managed system. --> + +<codeblock rev="ver">IMPALA_STATE_STORE_HOST=127.0.0.1 +IMPALA_STATE_STORE_PORT=24000 +IMPALA_BACKEND_PORT=22000 +IMPALA_LOG_DIR=/var/log/impala +IMPALA_CATALOG_SERVICE_HOST=... +IMPALA_STATE_STORE_HOST=... + +export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \ + -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}} +IMPALA_SERVER_ARGS=" \ +-log_dir=${IMPALA_LOG_DIR} \ +-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \ +-state_store_port=${IMPALA_STATE_STORE_PORT} \ +-use_statestore \ +-state_store_host=${IMPALA_STATE_STORE_HOST} \ +-be_port=${IMPALA_BACKEND_PORT}" +export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock> + + <p> + To use alternate values, edit the defaults file, then restart all the Impala-related + services so that the changes take effect. Restart the Impala server using the following + commands: + </p> + +<codeblock>$ sudo service impala-server restart +Stopping Impala Server: [ OK ] +Starting Impala Server: [ OK ]</codeblock> + + <p> + Restart the Impala statestore using the following commands: + </p> + +<codeblock>$ sudo service impala-state-store restart +Stopping Impala State Store Server: [ OK ] +Starting Impala State Store Server: [ OK ]</codeblock> + + <p> + Restart the Impala catalog service using the following commands: + </p> + +<codeblock>$ sudo service impala-catalog restart +Stopping Impala Catalog Server: [ OK ] +Starting Impala Catalog Server: [ OK ]</codeblock> + + <p> + Some common settings to change include: + </p> + + <ul> + <li> + <p> + Statestore address. Where practical, put the statestore on a separate host not + running the <cmdname>impalad</cmdname> daemon. In that recommended configuration, + the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using + the loopback address. If the statestore is hosted on a machine with an IP address of + 192.168.0.27, change: + </p> +<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock> + <p> + to: + </p> +<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock> + </li> + + <li rev="1.2"> + <p> + Catalog server address (including both the hostname and the port number). Update the + value of the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph> variable. Cloudera + recommends the catalog server be on the same host as the statestore. In that + recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to the + catalog server using the loopback address. If the catalog service is hosted on a + machine with an IP address of 192.168.0.27, add the following line: + </p> +<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock> + <p> + The <filepath>/etc/default/impala</filepath> defaults file currently does not define + an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you add one it + will be recognized by the service startup/shutdown script. Add a definition for this + variable to <filepath>/etc/default/impala</filepath> and add the option + <codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If the port is + different than the default 26000, also add the option + <codeph>-catalog_service_port=<varname>port</varname></codeph>. + </p> + </li> + + <li id="mem_limit"> + <p> + Memory limits. You can limit the amount of memory available to Impala. For example, + to allow Impala to use no more than 70% of system memory, change: + </p> +<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. --> +<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \ + -log_dir=${IMPALA_LOG_DIR} \ + -state_store_port=${IMPALA_STATE_STORE_PORT} \ + -use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \ + -be_port=${IMPALA_BACKEND_PORT}}</codeblock> + <p> + to: + </p> +<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \ + -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \ + -use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \ + -be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock> + <p> + You can specify the memory limit using absolute notation such as + <codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical memory + such as <codeph>60%</codeph>. + </p> + + <note> + Queries that exceed the specified memory limit are aborted. Percentage limits are + based on the physical memory of the machine and do not consider cgroups. + </note> + </li> + + <li> + <p> + Core dump enablement. To enable core dumps on systems not managed by Cloudera + Manager, change: + </p> +<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock> + <p> + to: + </p> +<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock> + <p> + On systems managed by Cloudera Manager, enable the <uicontrol>Enable Core + Dump</uicontrol> setting for the Impala service. + </p> + + <note conref="../shared/impala_common.xml#common/core_dump_considerations"/> + </li> + + <li> + <p> + Authorization using the open source Sentry plugin. Specify the + <codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph> + options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and + <codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support + for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for + details. + </p> + </li> + + <li> + <p> + Auditing for successful or blocked Impala queries, another aspect of security. + Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph> + option and optionally the + <codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph> + and <codeph>-abort_on_failed_audit_event</codeph> options as part of the + <codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and + customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details. + </p> + </li> + + <li> + <p> + Password protection for the Impala web UI, which listens on port 25000 by default. + This feature involves adding some or all of the + <codeph>--webserver_password_file</codeph>, + <codeph>--webserver_authentication_domain</codeph>, and + <codeph>--webserver_certificate_file</codeph> options to the + <codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph> + settings. See <xref href="impala_security_guidelines.xml#security_guidelines"/> for + details. + </p> + </li> + + <li id="default_query_options"> + <p rev="DOCS-677"> + Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is a + comma-separated list of query options and values: +<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...' +</codeblock> + These options control the behavior of queries performed by this + <cmdname>impalad</cmdname> instance. The option values you specify here override the + default values for <xref href="impala_query_options.xml#query_options">Impala query + options</xref>, as shown by the <codeph>SET</codeph> statement in + <cmdname>impala-shell</cmdname>. + </p> + </li> + +<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up. + <li rev="1.2"> + <p> + Options for resource management, in conjunction with the YARN component. These options include + <codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>. + <ph rev="1.4.0">Additional options to help fine-tune the resource estimates are + <codeph>-ârm_always_use_defaults</codeph>, + <codeph>-ârm_default_memory=<varname>size</varname></codeph>, and + <codeph>-ârm_default_cpu_cores</codeph>.</ph> For details about these options, see + <xref href="impala_resource_management.xml#rm_options"/>. See + <xref href="impala_resource_management.xml#resource_management"/> for information about resource + management in general. + </p> + </li> +--> + + <li> + <p> + During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values, + particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or + gather debugging information. + </p> + </li> + </ul> + +<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up. + <p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/> +--> + + <note> + <p> + These startup options for the <cmdname>impalad</cmdname> daemon are different from the + command-line options for the <cmdname>impala-shell</cmdname> command. For the + <cmdname>impala-shell</cmdname> options, see + <xref href="impala_shell_options.xml#shell_options"/>. + </p> + </note> + + <p audience="Cloudera" outputclass="toc inpage"/> + + </conbody> + + <concept audience="Cloudera" id="config_options_impalad_details"> + + <title>Configuration Options for impalad Daemon</title> + + <conbody> + + <p> + Some common settings to change include: + </p> + + <ul> + <li> + <p> + Statestore address. Where practical, put the statestore on a separate host not + running the <cmdname>impalad</cmdname> daemon. In that recommended configuration, + the <cmdname>impalad</cmdname> daemon cannot refer to the statestore server using + the loopback address. If the statestore is hosted on a machine with an IP address + of 192.168.0.27, change: + </p> +<codeblock>IMPALA_STATE_STORE_HOST=127.0.0.1</codeblock> + <p> + to: + </p> +<codeblock>IMPALA_STATE_STORE_HOST=192.168.0.27</codeblock> + </li> + + <li rev="1.2"> + <p> + Catalog server address. Update the <codeph>IMPALA_CATALOG_SERVICE_HOST</codeph> + variable, including both the hostname and the port number in the value. Cloudera + recommends the catalog server be on the same host as the statestore. In that + recommended configuration, the <cmdname>impalad</cmdname> daemon cannot refer to + the catalog server using the loopback address. If the catalog service is hosted on + a machine with an IP address of 192.168.0.27, add the following line: + </p> +<codeblock>IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000</codeblock> + <p> + The <filepath>/etc/default/impala</filepath> defaults file currently does not + define an <codeph>IMPALA_CATALOG_ARGS</codeph> environment variable, but if you + add one it will be recognized by the service startup/shutdown script. Add a + definition for this variable to <filepath>/etc/default/impala</filepath> and add + the option <codeph>-catalog_service_host=<varname>hostname</varname></codeph>. If + the port is different than the default 26000, also add the option + <codeph>-catalog_service_port=<varname>port</varname></codeph>. + </p> + </li> + + <li id="mem_limit"> + Memory limits. You can limit the amount of memory available to Impala. For example, + to allow Impala to use no more than 70% of system memory, change: +<!-- Note: also needs to be updated for each release to reflect latest /etc/default/impala. --> +<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \ + -log_dir=${IMPALA_LOG_DIR} \ + -state_store_port=${IMPALA_STATE_STORE_PORT} \ + -use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \ + -be_port=${IMPALA_BACKEND_PORT}}</codeblock> + <p> + to: + </p> +<codeblock>export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \ + -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \ + -use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \ + -be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}</codeblock> + <p> + You can specify the memory limit using absolute notation such as + <codeph>500m</codeph> or <codeph>2G</codeph>, or as a percentage of physical + memory such as <codeph>60%</codeph>. + </p> + + <note> + Queries that exceed the specified memory limit are aborted. Percentage limits are + based on the physical memory of the machine and do not consider cgroups. + </note> + </li> + + <li> + Core dump enablement. To enable core dumps, change: +<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}</codeblock> + <p> + to: + </p> +<codeblock>export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}</codeblock> + <note> + The location of core dump files may vary according to your operating system + configuration. Other security settings may prevent Impala from writing core dumps + even when this option is enabled. + </note> + </li> + + <li> + Authorization using the open source Sentry plugin. Specify the + <codeph>-server_name</codeph> and <codeph>-authorization_policy_file</codeph> + options as part of the <codeph>IMPALA_SERVER_ARGS</codeph> and + <codeph>IMPALA_STATE_STORE_ARGS</codeph> settings to enable the core Impala support + for authentication. See <xref href="impala_authorization.xml#secure_startup"/> for + details. + </li> + + <li> + Auditing for successful or blocked Impala queries, another aspect of security. + Specify the <codeph>-audit_event_log_dir=<varname>directory_path</varname></codeph> + option and optionally the + <codeph>-max_audit_event_log_file_size=<varname>number_of_queries</varname></codeph> + and <codeph>-abort_on_failed_audit_event</codeph> options as part of the + <codeph>IMPALA_SERVER_ARGS</codeph> settings, for each Impala node, to enable and + customize auditing. See <xref href="impala_auditing.xml#auditing"/> for details. + </li> + + <li> + Password protection for the Impala web UI, which listens on port 25000 by default. + This feature involves adding some or all of the + <codeph>--webserver_password_file</codeph>, + <codeph>--webserver_authentication_domain</codeph>, and + <codeph>--webserver_certificate_file</codeph> options to the + <codeph>IMPALA_SERVER_ARGS</codeph> and <codeph>IMPALA_STATE_STORE_ARGS</codeph> + settings. See <xref href="impala_security_webui.xml"/> for details. + </li> + + <li id="default_query_options"> + Another setting you might add to <codeph>IMPALA_SERVER_ARGS</codeph> is: +<codeblock>-default_query_options='<varname>option</varname>=<varname>value</varname>,<varname>option</varname>=<varname>value</varname>,...' +</codeblock> + These options control the behavior of queries performed by this + <cmdname>impalad</cmdname> instance. The option values you specify here override the + default values for <xref href="impala_query_options.xml#query_options">Impala query + options</xref>, as shown by the <codeph>SET</codeph> statement in + <cmdname>impala-shell</cmdname>. + </li> + +<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up. + <li rev="1.2"> + Options for resource management, in conjunction with the YARN component. These options + include <codeph>-enable_rm</codeph> and <codeph>-cgroup_hierarchy_path</codeph>. + <ph rev="1.4.0">Additional options to help fine-tune the resource estimates are + <codeph>-ârm_always_use_defaults</codeph>, + <codeph>-ârm_default_memory=<varname>size</varname></codeph>, and + <codeph>-ârm_default_cpu_cores</codeph>.</ph> For details about these options, see + <xref href="impala_resource_management.xml#rm_options"/>. See + <xref href="impala_resource_management.xml#resource_management"/> for information about resource + management in general. + </li> +--> + + <li> + During troubleshooting, <keyword keyref="support_org"/> might direct you to change other values, + particularly for <codeph>IMPALA_SERVER_ARGS</codeph>, to work around issues or + gather debugging information. + </li> + </ul> + +<!-- Removing this reference now that the options are de-emphasized / desupported in CDH 5.5 / Impala 2.3 and up. + <p conref="impala_resource_management.xml#rm_options/resource_management_impalad_options"/> +--> + + <note> + <p> + These startup options for the <cmdname>impalad</cmdname> daemon are different from + the command-line options for the <cmdname>impala-shell</cmdname> command. For the + <cmdname>impala-shell</cmdname> options, see + <xref href="impala_shell_options.xml#shell_options"/>. + </p> + </note> + + </conbody> + + </concept> + + <concept audience="Cloudera" id="config_options_statestored_details"> + + <title>Configuration Options for statestored Daemon</title> + + <conbody> + + <p></p> + + </conbody> + + </concept> + + <concept audience="Cloudera" id="config_options_catalogd_details"> + + <title>Configuration Options for catalogd Daemon</title> + + <conbody> + + <p></p> + + </conbody> + + </concept> + + </concept> + + <concept id="config_options_checking"> + + <title>Checking the Values of Impala Configuration Options</title> + + <conbody> + + <p> + You can check the current runtime value of all these settings through the Impala web + interface, available by default at + <codeph>http://<varname>impala_hostname</varname>:25000/varz</codeph> for the + <cmdname>impalad</cmdname> daemon, + <codeph>http://<varname>impala_hostname</varname>:25010/varz</codeph> for the + <cmdname>statestored</cmdname> daemon, or + <codeph>http://<varname>impala_hostname</varname>:25020/varz</codeph> for the + <cmdname>catalogd</cmdname> daemon. In the Cloudera Manager interface, you can see the + link to the appropriate <uicontrol><varname>service_name</varname> Web UI</uicontrol> + page when you look at the status page for a specific daemon on a specific host. + </p> + + </conbody> + + </concept> + + <concept id="config_options_impalad"> + + <title>Startup Options for impalad Daemon</title> + + <conbody> + + <p> + The <codeph>impalad</codeph> daemon implements the main Impala service, which performs + query processing and reads and writes the data files. + </p> + + </conbody> + + </concept> + + <concept id="config_options_statestored"> + + <title>Startup Options for statestored Daemon</title> + + <conbody> + + <p> + The <cmdname>statestored</cmdname> daemon implements the Impala statestore service, + which monitors the availability of Impala services across the cluster, and handles + situations such as nodes becoming unavailable or becoming available again. + </p> + + </conbody> + + </concept> + + <concept rev="1.2" id="config_options_catalogd"> + + <title>Startup Options for catalogd Daemon</title> + + <conbody> + + <p> + The <cmdname>catalogd</cmdname> daemon implements the Impala catalog service, which + broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts + data, or performs other kinds of DDL and DML operations. + </p> + + <p conref="../shared/impala_common.xml#common/load_catalog_in_background"/> + + </conbody> + + </concept> + +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_config_performance.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_config_performance.xml b/docs/topics/impala_config_performance.xml new file mode 100644 index 0000000..0a44a33 --- /dev/null +++ b/docs/topics/impala_config_performance.xml @@ -0,0 +1,179 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="config_performance"> + + <title>Post-Installation Configuration for Impala</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Administrators"/> + </metadata> + </prolog> + + <conbody> + + <p id="p_24"> + This section describes the mandatory and recommended configuration settings for Impala. If Impala is + installed using Cloudera Manager, some of these configurations are completed automatically; you must still + configure short-circuit reads manually. If you installed Impala without Cloudera Manager, or if you want to + customize your environment, consider making the changes described in this topic. + </p> + + <p> +<!-- Could conref this paragraph from ciiu_install.xml. --> + In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular + component configuration details in one of the free-form fields on the Impala configuration pages within + Cloudera Manager. <ph conref="../shared/impala_common.xml#common/safety_valve"/> + </p> + + <ul> + <li> + You must enable short-circuit reads, whether or not Impala was installed through Cloudera Manager. This + setting goes in the Impala configuration settings, not the Hadoop-wide settings. + </li> + + <li> + If you installed Impala in an environment that is not managed by Cloudera Manager, you must enable block + location tracking, and you can optionally enable native checksumming for optimal performance. + </li> + + <li> + If you deployed Impala using Cloudera Manager see + <xref href="impala_perf_testing.xml#performance_testing"/> to confirm proper configuration. + </li> + </ul> + + <section id="section_fhq_wyv_ls"> + <title>Mandatory: Short-Circuit Reads</title> + <p> Enabling short-circuit reads allows Impala to read local data directly + from the file system. This removes the need to communicate through the + DataNodes, improving performance. This setting also minimizes the number + of additional copies of data. Short-circuit reads requires + <codeph>libhadoop.so</codeph> + <!-- This link went stale. Not obvious how to keep it in sync with whatever Hadoop CDH is using behind the scenes. So hide the link for now. --> + <!-- (the <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>) --> + (the Hadoop Native Library) to be accessible to both the server and the + client. <codeph>libhadoop.so</codeph> is not available if you have + installed from a tarball. You must install from an + <codeph>.rpm</codeph>, <codeph>.deb</codeph>, or parcel to use + short-circuit local reads. <note> If you use Cloudera Manager, you can + enable short-circuit reads through a checkbox in the user interface + and that setting takes effect for Impala as well. </note> + </p> + <p> + <b>To configure DataNodes for short-circuit reads:</b> + </p> + <ol id="ol_qlq_wyv_ls"> + <li id="copy_config_files"> Copy the client + <codeph>core-site.xml</codeph> and <codeph>hdfs-site.xml</codeph> + configuration files from the Hadoop configuration directory to the + Impala configuration directory. The default Impala configuration + location is <codeph>/etc/impala/conf</codeph>. </li> + <li> + <indexterm audience="Cloudera" + >dfs.client.read.shortcircuit</indexterm> + <indexterm audience="Cloudera">dfs.domain.socket.path</indexterm> + <indexterm audience="Cloudera" + >dfs.client.file-block-storage-locations.timeout.millis</indexterm> + On all Impala nodes, configure the following properties in <!-- Exact timing is unclear, since we say farther down to copy /etc/hadoop/conf/hdfs-site.xml to /etc/impala/conf. + Which wouldn't work if we already modified the Impala version of the file here. Not to mention that this + doesn't take the CM interface into account, where these /etc files might not exist in those locations. --> + <!-- <codeph>/etc/impala/conf/hdfs-site.xml</codeph> as shown: --> + Impala's copy of <codeph>hdfs-site.xml</codeph> as shown: <codeblock><property> + <name>dfs.client.read.shortcircuit</name> + <value>true</value> +</property> + +<property> + <name>dfs.domain.socket.path</name> + <value>/var/run/hdfs-sockets/dn</value> +</property> + +<property> + <name>dfs.client.file-block-storage-locations.timeout.millis</name> + <value>10000</value> +</property></codeblock> + <!-- Former socket.path value: <value>/var/run/hadoop-hdfs/dn._PORT</value> --> + <!-- + <note> + The text <codeph>_PORT</codeph> appears just as shown; you do not need to + substitute a number. + </note> +--> + </li> + <li> + <p> If <codeph>/var/run/hadoop-hdfs/</codeph> is group-writable, make + sure its group is <codeph>root</codeph>. </p> + <note> If you are also going to enable block location tracking, you + can skip copying configuration files and restarting DataNodes and go + straight to <xref href="#config_performance/block_location_tracking" + >Optional: Block Location Tracking</xref>. + Configuring short-circuit reads and block location tracking require + the same process of copying files and restarting services, so you + can complete that process once when you have completed all + configuration changes. Whether you copy files and restart services + now or during configuring block location tracking, short-circuit + reads are not enabled until you complete those final steps. </note> + </li> + <li id="restart_all_datanodes"> After applying these changes, restart + all DataNodes. </li> + </ol> + </section> + + <section id="block_location_tracking"> + + <title>Mandatory: Block Location Tracking</title> + + <p> + Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing + better utilization of the underlying disks. Impala will not start unless this setting is enabled. + </p> + + <p> + <b>To enable block location tracking:</b> + </p> + + <ol> + <li> + For each DataNode, adding the following to the <codeph>hdfs-site.xml</codeph> file: +<codeblock><property> + <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> + <value>true</value> +</property> </codeblock> + </li> + + <li conref="#config_performance/copy_config_files"/> + + <li conref="#config_performance/restart_all_datanodes"/> + </ol> + </section> + + <section id="native_checksumming"> + + <title>Optional: Native Checksumming</title> + + <p> + Enabling native checksumming causes Impala to use an optimized native library for computing checksums, if + that library is available. + </p> + + <p id="p_29"> + <b>To enable native checksumming:</b> + </p> + + <p> + If you installed CDH from packages, the native checksumming library is installed and setup correctly. In + such a case, no additional steps are required. Conversely, if you installed by other means, such as with + tarballs, native checksumming may not be available due to missing shared objects. Finding the message + "<codeph>Unable to load native-hadoop library for your platform... using builtin-java classes where + applicable</codeph>" in the Impala logs indicates native checksumming may be unavailable. To enable native + checksumming, you must build and install <codeph>libhadoop.so</codeph> (the + <!-- Another instance of stale link. --> + <!-- <xref href="http://hadoop.apache.org/docs/r0.19.1/native_libraries.html" scope="external" format="html">Hadoop Native Library</xref>). --> + Hadoop Native Library). + </p> + </section> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_connecting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_connecting.xml b/docs/topics/impala_connecting.xml new file mode 100644 index 0000000..354e698 --- /dev/null +++ b/docs/topics/impala_connecting.xml @@ -0,0 +1,202 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="connecting"> + + <title>Connecting to impalad through impala-shell</title> + <titlealts audience="PDF"><navtitle>Connecting to impalad</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="Network"/> + <data name="Category" value="DataNode"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + +<!-- +TK: This would be a good theme for a tutorial topic. +Lots of nuances to illustrate through sample code. +--> + + <p> + Within an <cmdname>impala-shell</cmdname> session, you can only issue queries while connected to an instance + of the <cmdname>impalad</cmdname> daemon. You can specify the connection information: + <ul> + <li> + Through command-line options when you run the <cmdname>impala-shell</cmdname> command. + </li> + <li> + Through a configuration file that is read when you run the <cmdname>impala-shell</cmdname> command. + </li> + <li> + During an <cmdname>impala-shell</cmdname> session, by issuing a <codeph>CONNECT</codeph> command. + </li> + </ul> + See <xref href="impala_shell_options.xml"/> for the command-line and configuration file options you can use. + </p> + + <p> + You can connect to any DataNode where an instance of <cmdname>impalad</cmdname> is running, + and that host coordinates the execution of all queries sent to it. + </p> + + <p> + For simplicity during development, you might always connect to the same host, perhaps running <cmdname>impala-shell</cmdname> on + the same host as <cmdname>impalad</cmdname> and specifying the hostname as <codeph>localhost</codeph>. + </p> + + <p> + In a production environment, you might enable load balancing, in which you connect to specific host/port combination + but queries are forwarded to arbitrary hosts. This technique spreads the overhead of acting as the coordinator + node among all the DataNodes in the cluster. See <xref href="impala_proxy.xml"/> for details. + </p> + + <p> + <b>To connect the Impala shell during shell startup:</b> + </p> + + <ol> + <li> + Locate the hostname of a DataNode within the cluster that is running an instance of the + <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something + other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the + port number also. + </li> + + <li> + Use the <codeph>-i</codeph> option to the + <cmdname>impala-shell</cmdname> interpreter to specify the connection information for + that instance of <cmdname>impalad</cmdname>: +<codeblock> +# When you are logged into the same machine running impalad. +# The prompt will reflect the current hostname. +$ impala-shell + +# When you are logged into the same machine running impalad. +# The host will reflect the hostname 'localhost'. +$ impala-shell -i localhost + +# When you are logged onto a different host, perhaps a client machine +# outside the Hadoop cluster. +$ impala-shell -i <varname>some.other.hostname</varname> + +# When you are logged onto a different host, and impalad is listening +# on a non-default port. Perhaps a load balancer is forwarding requests +# to a different host/port combination behind the scenes. +$ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</varname> +</codeblock> + </li> + </ol> + + <p> + <b>To connect the Impala shell after shell startup:</b> + </p> + + <ol> + <li> + Start the Impala shell with no connection: +<codeblock>$ impala-shell</codeblock> + <p> + You should see a prompt like the following: + </p> +<codeblock>Welcome to the Impala shell. Press TAB twice to see a list of available commands. + +Copyright (c) <varname>year</varname> Cloudera, Inc. All rights reserved. + +<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/> +[Not connected] > </codeblock> + </li> + + <li> + Locate the hostname of a DataNode within the cluster that is running an instance of the + <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something + other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the + port number also. + </li> + + <li> + Use the <codeph>connect</codeph> command to connect to an Impala instance. Enter a command of the form: +<codeblock>[Not connected] > connect <varname>impalad-host</varname> +[<varname>impalad-host</varname>:21000] ></codeblock> + <note> + Replace <varname>impalad-host</varname> with the hostname you have configured for any DataNode running + Impala in your environment. The changed prompt indicates a successful connection. + </note> + </li> + </ol> + + <p> + <b>To start <cmdname>impala-shell</cmdname> in a specific database:</b> + </p> + + <p> + You can use all the same connection options as in previous examples. + For simplicity, these examples assume that you are logged into one of + the DataNodes that is running the <cmdname>impalad</cmdname> daemon. + </p> + + <ol> + <li> + Find the name of the database containing the relevant tables, views, and so + on that you want to operate on. + </li> + + <li> + Use the <codeph>-d</codeph> option to the + <cmdname>impala-shell</cmdname> interpreter to connect and immediately + switch to the specified database, without the need for a <codeph>USE</codeph> + statement or fully qualified names: +<codeblock> +# Subsequent queries with unqualified names operate on +# tables, views, and so on inside the database named 'staging'. +$ impala-shell -i localhost -d staging + +# It is common during development, ETL, benchmarking, and so on +# to have different databases containing the same table names +# but with different contents or layouts. +$ impala-shell -i localhost -d parquet_snappy_compression +$ impala-shell -i localhost -d parquet_gzip_compression +</codeblock> + </li> + </ol> + + <p> + <b>To run one or several statements in non-interactive mode:</b> + </p> + + <p> + You can use all the same connection options as in previous examples. + For simplicity, these examples assume that you are logged into one of + the DataNodes that is running the <cmdname>impalad</cmdname> daemon. + </p> + + <ol> + <li> + Construct a statement, or a file containing a sequence of statements, + that you want to run in an automated way, without typing or copying + and pasting each time. + </li> + + <li> + Invoke <cmdname>impala-shell</cmdname> with the <codeph>-q</codeph> option to run a single statement, or + the <codeph>-f</codeph> option to run a sequence of statements from a file. + The <cmdname>impala-shell</cmdname> command returns immediately, without going into + the interactive interpreter. +<codeblock> +# A utility command that you might run while developing shell scripts +# to manipulate HDFS files. +$ impala-shell -i localhost -d database_of_interest -q 'show tables' + +# A sequence of CREATE TABLE, CREATE VIEW, and similar DDL statements +# can go into a file to make the setup process repeatable. +$ impala-shell -i localhost -d database_of_interest -f recreate_tables.sql +</codeblock> + </li> + </ol> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_conversion_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_conversion_functions.xml b/docs/topics/impala_conversion_functions.xml new file mode 100644 index 0000000..36f5ec4 --- /dev/null +++ b/docs/topics/impala_conversion_functions.xml @@ -0,0 +1,758 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="conversion_functions"> + + <title>Impala Type Conversion Functions</title> + <titlealts audience="PDF"><navtitle>Type Conversion Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + Conversion functions are usually used in combination with other functions, to explicitly pass the expected + data types. Impala has strict rules regarding data types for function parameters. For example, Impala does + not automatically convert a <codeph>DOUBLE</codeph> value to <codeph>FLOAT</codeph>, a + <codeph>BIGINT</codeph> value to <codeph>INT</codeph>, or other conversion where precision could be lost or + overflow could occur. Also, for reporting or dealing with loosely defined schemas in big data contexts, + you might frequently need to convert values to or from the <codeph>STRING</codeph> type. + </p> + + <note> + Although in CDH 5.5.0, the <codeph>SHOW FUNCTIONS</codeph> output for + database <codeph>_IMPALA_BUILTINS</codeph> contains some function signatures + matching the pattern <codeph>castto*</codeph>, these functions are not intended + for public use and are expected to be hidden in future. + </note> + + <p> + <b>Function reference:</b> + </p> + + <p> + Impala supports the following type conversion functions: + </p> + +<dl> + +<dlentry id="cast"> +<dt> +<codeph>cast(<varname>expr</varname> AS <varname>type</varname>)</codeph> +</dt> + +<dd> +<indexterm audience="Cloudera">cast() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to any other type. +If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Usage notes:</b> +Use <codeph>CAST</codeph> when passing a column value or literal to a function that +expects a parameter with a different type. +Frequently used in SQL operations such as <codeph>CREATE TABLE AS SELECT</codeph> +and <codeph>INSERT ... VALUES</codeph> to ensure that values from various sources +are of the appropriate type for the destination columns. +Where practical, do a one-time <codeph>CAST()</codeph> operation during the ingestion process +to make each column into the appropriate type, rather than using many <codeph>CAST()</codeph> +operations in each query; doing type conversions for each row during each query can be expensive +for tables with millions or billions of rows. +</p> + <p conref="../shared/impala_common.xml#common/timezone_conversion_caveat"/> + +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<codeblock>select concat('Here are the first ',10,' results.'); -- Fails +select concat('Here are the first ',cast(10 as string),' results.'); -- Succeeds +</codeblock> +<p> +The following example starts with a text table where every column has a type of <codeph>STRING</codeph>, +which might be how you ingest data of unknown schema until you can verify the cleanliness of the underly values. +Then it uses <codeph>CAST()</codeph> to create a new Parquet table with the same data, but using specific +numeric data types for the columns with numeric data. Using numeric types of appropriate sizes can result in +substantial space savings on disk and in memory, and performance improvements in queries, +over using strings or larger-than-necessary numeric types. +</p> +<codeblock>create table t1 (name string, x string, y string, z string); + +create table t2 stored as parquet +as select + name, + cast(x as bigint) x, + cast(y as timestamp) y, + cast(z as smallint) z +from t1; + +describe t2; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| name | string | | +| x | bigint | | +| y | smallint | | +| z | tinyint | | ++------+----------+---------+ +</codeblock> +<p conref="../shared/impala_common.xml#common/related_info"/> +<p> +<!-- TK: Can you cast to or from MAP, ARRAY, STRUCT? --> + For details of casts from each kind of data type, see the description of + the appropriate type: + <xref href="impala_tinyint.xml#tinyint"/>, + <xref href="impala_smallint.xml#smallint"/>, + <xref href="impala_int.xml#int"/>, + <xref href="impala_bigint.xml#bigint"/>, + <xref href="impala_float.xml#float"/>, + <xref href="impala_double.xml#double"/>, + <xref href="impala_decimal.xml#decimal"/>, + <xref href="impala_string.xml#string"/>, + <xref href="impala_char.xml#char"/>, + <xref href="impala_varchar.xml#varchar"/>, + <xref href="impala_timestamp.xml#timestamp"/>, + <xref href="impala_boolean.xml#boolean"/> +</p> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttobigint" audience="Cloudera"> +<dt> +<codeph>casttobigint(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttobigint() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>BIGINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>bigint</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>create table small_types (x tinyint, y smallint, z int); + +create table big_types as + select casttobigint(x) as x, casttobigint(y) as y, casttobigint(z) as z + from small_types; + +describe big_types; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | bigint | | +| y | bigint | | +| z | bigint | | ++------+--------+---------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttoboolean" audience="Cloudera"> +<dt> +<codeph>casttoboolean(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttoboolean() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>BOOLEAN</codeph>. +Numeric values of 0 evaluate to <codeph>false</codeph>, and non-zero values evaluate to <codeph>true</codeph>. +If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +In particular, <codeph>STRING</codeph> values (even <codeph>'1'</codeph>, <codeph>'0'</codeph>, <codeph>'true'</codeph> +or <codeph>'false'</codeph>) always return <codeph>NULL</codeph> when converted to <codeph>BOOLEAN</codeph>. +<p><b>Return type:</b> <codeph>boolean</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttoboolean(0); ++------------------+ +| casttoboolean(0) | ++------------------+ +| false | ++------------------+ + +select casttoboolean(1); ++------------------+ +| casttoboolean(1) | ++------------------+ +| true | ++------------------+ + +select casttoboolean(99); ++-------------------+ +| casttoboolean(99) | ++-------------------+ +| true | ++-------------------+ + +select casttoboolean(0.0); ++--------------------+ +| casttoboolean(0.0) | ++--------------------+ +| false | ++--------------------+ + +select casttoboolean(0.5); ++--------------------+ +| casttoboolean(0.5) | ++--------------------+ +| true | ++--------------------+ + +select casttoboolean(''); ++-------------------+ +| casttoboolean('') | ++-------------------+ +| NULL | ++-------------------+ + +select casttoboolean('yes'); ++----------------------+ +| casttoboolean('yes') | ++----------------------+ +| NULL | ++----------------------+ + +select casttoboolean('0'); ++--------------------+ +| casttoboolean('0') | ++--------------------+ +| NULL | ++--------------------+ + +select casttoboolean('true'); ++-----------------------+ +| casttoboolean('true') | ++-----------------------+ +| NULL | ++-----------------------+ + +select casttoboolean('false'); ++------------------------+ +| casttoboolean('false') | ++------------------------+ +| NULL | ++------------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttochar" audience="Cloudera"> +<dt> +<codeph>casttochar(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttochar() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>CHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>char</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>create table char_types as select casttochar('hello world') as c1, casttochar('xyz') as c2, casttochar('x') as c3; ++-------------------+ +| summary | ++-------------------+ +| Inserted 1 row(s) | ++-------------------+ + +describe char_types; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| c1 | string | | +| c2 | string | | +| c3 | string | | ++------+--------+---------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttodecimal" audience="Cloudera"> +<dt> +<codeph>casttodecimal(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttodecimal() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>DECIMAL</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>decimal</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttodecimal(5.4); ++--------------------+ +| casttodecimal(5.4) | ++--------------------+ +| 5.4 | ++--------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttodouble" audience="Cloudera"> +<dt> +<codeph>casttodouble(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttodouble() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>DOUBLE</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>double</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttodouble(5); ++-----------------+ +| casttodouble(5) | ++-----------------+ +| 5 | ++-----------------+ + +select casttodouble('3.141'); ++-----------------------+ +| casttodouble('3.141') | ++-----------------------+ +| 3.141 | ++-----------------------+ + +select casttodouble(1e6); ++--------------------+ +| casttodouble(1e+6) | ++--------------------+ +| 1000000 | ++--------------------+ + +select casttodouble(true); ++--------------------+ +| casttodouble(true) | ++--------------------+ +| 1 | ++--------------------+ + +select casttodouble(now()); ++---------------------+ +| casttodouble(now()) | ++---------------------+ +| 1447622306.031178 | ++---------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttofloat" audience="Cloudera"> +<dt> +<codeph>casttofloat(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttofloat() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>FLOAT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>float</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttofloat(5); ++----------------+ +| casttofloat(5) | ++----------------+ +| 5 | ++----------------+ + +select casttofloat('3.141'); ++----------------------+ +| casttofloat('3.141') | ++----------------------+ +| 3.141000032424927 | ++----------------------+ + +select casttofloat(1e6); ++-------------------+ +| casttofloat(1e+6) | ++-------------------+ +| 1000000 | ++-------------------+ + +select casttofloat(true); ++-------------------+ +| casttofloat(true) | ++-------------------+ +| 1 | ++-------------------+ + +select casttofloat(now()); ++--------------------+ +| casttofloat(now()) | ++--------------------+ +| 1447622400 | ++--------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttoint" audience="Cloudera"> +<dt> +<codeph>casttoint(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttoint() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>INT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>int</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttoint(5.4); ++----------------+ +| casttoint(5.4) | ++----------------+ +| 5 | ++----------------+ + +select casttoint(true); ++-----------------+ +| casttoint(true) | ++-----------------+ +| 1 | ++-----------------+ + +select casttoint(now()); ++------------------+ +| casttoint(now()) | ++------------------+ +| 1447622487 | ++------------------+ + +select casttoint('3.141'); ++--------------------+ +| casttoint('3.141') | ++--------------------+ +| NULL | ++--------------------+ + +select casttoint('3'); ++----------------+ +| casttoint('3') | ++----------------+ +| 3 | ++----------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttosmallint" audience="Cloudera"> +<dt> +<codeph>casttosmallint(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttosmallint() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>SMALLINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>smallint</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>create table big_types (x bigint, y int, z smallint); + +create table small_types as + select casttosmallint(x) as x, casttosmallint(y) as y, casttosmallint(z) as z + from big_types; + +describe small_types; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| x | smallint | | +| y | smallint | | +| z | smallint | | ++------+----------+---------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttostring" audience="Cloudera"> +<dt> +<codeph>casttostring(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttostring() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>STRING</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>string</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>create table numeric_types (x int, y bigint, z tinyint); + +create table string_types as + select casttostring(x) as x, casttostring(y) as y, casttostring(z) as z + from numeric_types; + +describe string_types; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | string | | +| y | string | | +| z | string | | ++------+--------+---------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttotimestamp" audience="Cloudera"> +<dt> +<codeph>casttotimestamp(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttotimestamp() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>TIMESTAMP</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>timestamp</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttotimestamp(1000); ++-----------------------+ +| casttotimestamp(1000) | ++-----------------------+ +| 1970-01-01 00:16:40 | ++-----------------------+ + +select casttotimestamp(1000.0); ++-------------------------+ +| casttotimestamp(1000.0) | ++-------------------------+ +| 1970-01-01 00:16:40 | ++-------------------------+ + +select casttotimestamp('1000'); ++-------------------------+ +| casttotimestamp('1000') | ++-------------------------+ +| NULL | ++-------------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttotinyint" audience="Cloudera"> +<dt> +<codeph>casttotinyint(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttotinyint() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>TINYINT</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>tinyint</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>create table big_types (x bigint, y int, z smallint); + +create table tiny_types as + select casttotinyint(x) as x, casttotinyint(y) as y, casttotinyint(z) as z + from big_types; + +describe tiny_types; ++------+---------+---------+ +| name | type | comment | ++------+---------+---------+ +| x | tinyint | | +| y | tinyint | | +| z | tinyint | | ++------+---------+---------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="casttovarchar" audience="Cloudera"> +<dt> +<codeph>casttovarchar(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">casttovarchar() function</indexterm> +<b>Purpose:</b> Converts the value of an expression to <codeph>VARCHAR</codeph>. If the expression value is of a type that cannot be converted to the target type, the result is <codeph>NULL</codeph>. +<p><b>Return type:</b> <codeph>varchar</codeph></p> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_usage"/> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p conref="../shared/impala_common.xml#common/cast_convenience_fn_example"/> +<codeblock>select casttovarchar('abcd'); ++-----------------------+ +| casttovarchar('abcd') | ++-----------------------+ +| abcd | ++-----------------------+ + +select casttovarchar(999); ++--------------------+ +| casttovarchar(999) | ++--------------------+ +| 999 | ++--------------------+ + +select casttovarchar(999.5); ++----------------------+ +| casttovarchar(999.5) | ++----------------------+ +| 999.5 | ++----------------------+ + +select casttovarchar(now()); ++-------------------------------+ +| casttovarchar(now()) | ++-------------------------------+ +| 2015-11-15 21:26:13.528073000 | ++-------------------------------+ + +select casttovarchar(true); ++---------------------+ +| casttovarchar(true) | ++---------------------+ +| 1 | ++---------------------+ +</codeblock> +</dd> +</dlentry> + +<dlentry rev="2.3.0" id="typeof"> +<dt> +<codeph>typeof(type value)</codeph> +</dt> +<dd> +<indexterm audience="Cloudera">typeof() function</indexterm> +<b>Purpose:</b> Returns the name of the data type corresponding to an expression. For types with +extra attributes, such as length for <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph>, +or precision and scale for <codeph>DECIMAL</codeph>, includes the full specification of the type. +<!-- To do: How about for columns of complex types? Or fields within complex types? --> +<p><b>Return type:</b> <codeph>string</codeph></p> +<p><b>Usage notes:</b> Typically used in interactive exploration of a schema, or in application code that programmatically generates schema definitions such as <codeph>CREATE TABLE</codeph> statements. +For example, previously, to understand the type of an expression such as +<codeph>col1 / col2</codeph> or <codeph>concat(col1, col2, col3)</codeph>, +you might have created a dummy table with a single row, using syntax such as <codeph>CREATE TABLE foo AS SELECT 5 / 3.0</codeph>, +and then doing a <codeph>DESCRIBE</codeph> to see the type of the row. +Or you might have done a <codeph>CREATE TABLE AS SELECT</codeph> operation to create a table and +copy data into it, only learning the types of the columns by doing a <codeph>DESCRIBE</codeph> afterward. +This technique is especially useful for arithmetic expressions involving <codeph>DECIMAL</codeph> types, +because the precision and scale of the result is typically different than that of the operands. +</p> +<p conref="../shared/impala_common.xml#common/added_in_230"/> +<p conref="../shared/impala_common.xml#common/example_blurb"/> +<p> +These examples show how to check the type of a simple literal or function value. +Notice how adding even tiny integers together changes the data type of the result to +avoid overflow, and how the results of arithmetic operations on <codeph>DECIMAL</codeph> values +have specific precision and scale attributes. +</p> +<codeblock>select typeof(2) ++-----------+ +| typeof(2) | ++-----------+ +| TINYINT | ++-----------+ + +select typeof(2+2) ++---------------+ +| typeof(2 + 2) | ++---------------+ +| SMALLINT | ++---------------+ + +select typeof('xyz') ++---------------+ +| typeof('xyz') | ++---------------+ +| STRING | ++---------------+ + +select typeof(now()) ++---------------+ +| typeof(now()) | ++---------------+ +| TIMESTAMP | ++---------------+ + +select typeof(5.3 / 2.1) ++-------------------+ +| typeof(5.3 / 2.1) | ++-------------------+ +| DECIMAL(6,4) | ++-------------------+ + +select typeof(5.30001 / 2342.1); ++--------------------------+ +| typeof(5.30001 / 2342.1) | ++--------------------------+ +| DECIMAL(13,11) | ++--------------------------+ + +select typeof(typeof(2+2)) ++-----------------------+ +| typeof(typeof(2 + 2)) | ++-----------------------+ +| STRING | ++-----------------------+ +</codeblock> + +<p> +This example shows how even if you do not have a record of the type of a column, +for example because the type was changed by <codeph>ALTER TABLE</codeph> after the +original <codeph>CREATE TABLE</codeph>, you can still find out the type in a +more compact form than examining the full <codeph>DESCRIBE</codeph> output. +Remember to use <codeph>LIMIT 1</codeph> in such cases, to avoid an identical +result value for every row in the table. +</p> +<codeblock>create table typeof_example (a int, b tinyint, c smallint, d bigint); + +/* Empty result set if there is no data in the table. */ +select typeof(a) from typeof_example; + +/* OK, now we have some data but the type of column A is being changed. */ +insert into typeof_example values (1, 2, 3, 4); +alter table typeof_example change a a bigint; + +/* We can always find out the current type of that column without doing a full DESCRIBE. */ +select typeof(a) from typeof_example limit 1; ++-----------+ +| typeof(a) | ++-----------+ +| BIGINT | ++-----------+ +</codeblock> +<p> +This example shows how you might programmatically generate a <codeph>CREATE TABLE</codeph> statement +with the appropriate column definitions to hold the result values of arbitrary expressions. +The <codeph>typeof()</codeph> function lets you construct a detailed <codeph>CREATE TABLE</codeph> statement +without actually creating the table, as opposed to <codeph>CREATE TABLE AS SELECT</codeph> operations +where you create the destination table but only learn the column data types afterward through <codeph>DESCRIBE</codeph>. +</p> +<codeblock>describe typeof_example; ++------+----------+---------+ +| name | type | comment | ++------+----------+---------+ +| a | bigint | | +| b | tinyint | | +| c | smallint | | +| d | bigint | | ++------+----------+---------+ + +/* An ETL or business intelligence tool might create variations on a table with different file formats, + different sets of columns, and so on. TYPEOF() lets an application introspect the types of the original columns. */ +select concat('create table derived_table (a ', typeof(a), ', b ', typeof(b), ', c ', + typeof(c), ', d ', typeof(d), ') stored as parquet;') + as 'create table statement' +from typeof_example limit 1; ++-------------------------------------------------------------------------------------------+ +| create table statement | ++-------------------------------------------------------------------------------------------+ +| create table derived_table (a BIGINT, b TINYINT, c SMALLINT, d BIGINT) stored as parquet; | ++-------------------------------------------------------------------------------------------+ +</codeblock> +</dd> +</dlentry> + +</dl> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_count.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_count.xml b/docs/topics/impala_count.xml new file mode 100644 index 0000000..4f9a697 --- /dev/null +++ b/docs/topics/impala_count.xml @@ -0,0 +1,236 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="count"> + + <title>COUNT Function</title> + <titlealts audience="PDF"><navtitle>COUNT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Analytic Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">count() function</indexterm> + An aggregate function that returns the number of rows, or the number of non-<codeph>NULL</codeph> rows. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>COUNT([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock> + + <p> + Depending on the argument, <codeph>COUNT()</codeph> considers rows that meet certain conditions: + </p> + + <ul> + <li> + The notation <codeph>COUNT(*)</codeph> includes <codeph>NULL</codeph> values in the total. + </li> + + <li> + The notation <codeph>COUNT(<varname>column_name</varname>)</codeph> only considers rows where the column + contains a non-<codeph>NULL</codeph> value. + </li> + + <li> + You can also combine <codeph>COUNT</codeph> with the <codeph>DISTINCT</codeph> operator to eliminate + duplicates before counting, and to count the combinations of values across multiple columns. + </li> + </ul> + + <p> + When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of + grouping values. + </p> + + <p> + <b>Return type:</b> <codeph>BIGINT</codeph> + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/partition_key_optimization"/> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- How many rows total are in the table, regardless of NULL values? +select count(*) from t1; +-- How many rows are in the table with non-NULL values for a column? +select count(c1) from t1; +-- Count the rows that meet certain conditions. +-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col). +select count(*) from t1 where x > 10; +select count(c1) from t1 where x > 10; +-- Can also be used in combination with DISTINCT and/or GROUP BY. +-- Combine COUNT and DISTINCT to find the number of unique values. +-- Must use column names rather than * with COUNT(DISTINCT ...) syntax. +-- Rows with NULL values are not counted. +select count(distinct c1) from t1; +-- Rows with a NULL value in _either_ column are not counted. +select count(distinct c1, c2) from t1; +-- Return more than one result. +select month, year, count(distinct visitor_id) from web_stats group by month, year; +</codeblock> + + <p rev="2.0.0"> + The following examples show how to use <codeph>COUNT()</codeph> in an analytic context. They use a table + containing integers from 1 to 10. Notice how the <codeph>COUNT()</codeph> is reported for each input value, as + opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set. +<codeblock>select x, property, count(x) over (partition by property) as count from int_t where property in ('odd','even'); ++----+----------+-------+ +| x | property | count | ++----+----------+-------+ +| 2 | even | 5 | +| 4 | even | 5 | +| 6 | even | 5 | +| 8 | even | 5 | +| 10 | even | 5 | +| 1 | odd | 5 | +| 3 | odd | 5 | +| 5 | odd | 5 | +| 7 | odd | 5 | +| 9 | odd | 5 | ++----+----------+-------+ +</codeblock> + +Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving +set of rows (the <q>window</q>). The following examples use <codeph>COUNT()</codeph> in an analytic context +(that is, with an <codeph>OVER()</codeph> clause) to produce a running count of all the even values, +then a running count of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly +activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +therefore all of these examples produce the same results: +<codeblock>select x, property, + count(x) over (partition by property <b>order by x</b>) as 'cumulative count' + from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative count | ++----+----------+------------------+ +| 2 | even | 1 | +| 4 | even | 2 | +| 6 | even | 3 | +| 8 | even | 4 | +| 10 | even | 5 | +| 1 | odd | 1 | +| 3 | odd | 2 | +| 5 | odd | 3 | +| 7 | odd | 4 | +| 9 | odd | 5 | ++----+----------+------------------+ + +select x, property, + count(x) over + ( + partition by property + <b>order by x</b> + <b>range between unbounded preceding and current row</b> + ) as 'cumulative total' +from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative count | ++----+----------+------------------+ +| 2 | even | 1 | +| 4 | even | 2 | +| 6 | even | 3 | +| 8 | even | 4 | +| 10 | even | 5 | +| 1 | odd | 1 | +| 3 | odd | 2 | +| 5 | odd | 3 | +| 7 | odd | 4 | +| 9 | odd | 5 | ++----+----------+------------------+ + +select x, property, + count(x) over + ( + partition by property + <b>order by x</b> + <b>rows between unbounded preceding and current row</b> + ) as 'cumulative total' + from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative count | ++----+----------+------------------+ +| 2 | even | 1 | +| 4 | even | 2 | +| 6 | even | 3 | +| 8 | even | 4 | +| 10 | even | 5 | +| 1 | odd | 1 | +| 3 | odd | 2 | +| 5 | odd | 3 | +| 7 | odd | 4 | +| 9 | odd | 5 | ++----+----------+------------------+ +</codeblock> + +The following examples show how to construct a moving window, with a running count taking into account 1 row before +and 1 row after the current row, within the same partition (all the even values or all the odd values). +Therefore, the count is consistently 3 for rows in the middle of the window, and 2 for +rows near the ends of the window, where there is no preceding or no following row in the partition. +Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of +moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph> +clause: +<codeblock>select x, property, + count(x) over + ( + partition by property + <b>order by x</b> + <b>rows between 1 preceding and 1 following</b> + ) as 'moving total' + from int_t where property in ('odd','even'); ++----+----------+--------------+ +| x | property | moving total | ++----+----------+--------------+ +| 2 | even | 2 | +| 4 | even | 3 | +| 6 | even | 3 | +| 8 | even | 3 | +| 10 | even | 2 | +| 1 | odd | 2 | +| 3 | odd | 3 | +| 5 | odd | 3 | +| 7 | odd | 3 | +| 9 | odd | 2 | ++----+----------+--------------+ + +-- Doesn't work because of syntax restriction on RANGE clause. +select x, property, + count(x) over + ( + partition by property + <b>order by x</b> + <b>range between 1 preceding and 1 following</b> + ) as 'moving total' +from int_t where property in ('odd','even'); +ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW. +</codeblock> + </p> + + <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#analytic_functions"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_data_source.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_data_source.xml b/docs/topics/impala_create_data_source.xml new file mode 100644 index 0000000..a93d6bc --- /dev/null +++ b/docs/topics/impala_create_data_source.xml @@ -0,0 +1,35 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept audience="Cloudera" rev="1.4.0" id="create_data_source"> + + <title>CREATE DATA SOURCE Statement</title> + <titlealts audience="PDF"><navtitle>CREATE DATA SOURCE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE DATA SOURCE statement</indexterm> + </p> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_create_database.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_database.xml b/docs/topics/impala_create_database.xml new file mode 100644 index 0000000..de556bc --- /dev/null +++ b/docs/topics/impala_create_database.xml @@ -0,0 +1,137 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="create_database"> + + <title>CREATE DATABASE Statement</title> + <titlealts audience="PDF"><navtitle>CREATE DATABASE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Databases"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="DDL"/> + <data name="Category" value="S3"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">CREATE DATABASE statement</indexterm> + Creates a new database. + </p> + + <p> + In Impala, a database is both: + </p> + + <ul> + <li> + A logical construct for grouping together related tables, views, and functions within their own namespace. + You might use a separate database for each application, set of related tables, or round of experimentation. + </li> + + <li> + A physical construct represented by a directory tree in HDFS. Tables (internal tables), partitions, and + data files are all located under this directory. You can perform HDFS-level operations such as backing it up and measuring space usage, + or remove it with a <codeph>DROP DATABASE</codeph> statement. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] <varname>database_name</varname>[COMMENT '<varname>database_comment</varname>'] + [LOCATION <varname>hdfs_path</varname>];</codeblock> + + <p conref="../shared/impala_common.xml#common/ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + A database is physically represented as a directory in HDFS, with a filename extension <codeph>.db</codeph>, + under the main Impala data directory. If the associated HDFS directory does not exist, it is created for you. + All databases and their associated directories are top-level objects, with no physical or logical nesting. + </p> + + <p> + After creating a database, to make it the current database within an <cmdname>impala-shell</cmdname> session, + use the <codeph>USE</codeph> statement. You can refer to tables in the current database without prepending + any qualifier to their names. + </p> + + <p> + When you first connect to Impala through <cmdname>impala-shell</cmdname>, the database you start in (before + issuing any <codeph>CREATE DATABASE</codeph> or <codeph>USE</codeph> statements) is named + <codeph>default</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/builtins_db"/> + + <p> + After creating a database, your <cmdname>impala-shell</cmdname> session or another + <cmdname>impala-shell</cmdname> connected to the same node can immediately access that database. To access + the database through the Impala daemon on a different node, issue the <codeph>INVALIDATE METADATA</codeph> + statement first while connected to that other node. + </p> + + <p> + Setting the <codeph>LOCATION</codeph> attribute for a new database is a way to work with sets of files in an + HDFS directory structure outside the default Impala data directory, as opposed to setting the + <codeph>LOCATION</codeph> attribute for each individual table. + </p> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/hive_blurb"/> + + <p> + When you create a database in Impala, the database can also be used by Hive. + When you create a database in Hive, issue an <codeph>INVALIDATE METADATA</codeph> + statement in Impala to make Impala permanently aware of the new database. + </p> + + <p> + The <codeph>SHOW DATABASES</codeph> statement lists all databases, or the databases whose name + matches a wildcard pattern. <ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, the + <codeph>SHOW DATABASES</codeph> output includes a second column that displays the associated + comment, if any, for each database.</ph> + </p> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + + <p rev="2.6.0 CDH-39913 IMPALA-1878"> + To specify that any tables created within a database reside on the Amazon S3 system, + you can include an <codeph>s3a://</codeph> prefix on the <codeph>LOCATION</codeph> + attribute. In <keyword keyref="impala26_full"/> and higher, Impala automatically creates any + required folders as the databases, tables, and partitions are created, and removes + them when they are dropped. + </p> + + <p conref="../shared/impala_common.xml#common/s3_ddl"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have write + permission for the parent HDFS directory under which the database + is located. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <codeblock conref="../shared/impala_common.xml#common/create_drop_db_example"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_databases.xml#databases"/>, <xref href="impala_drop_database.xml#drop_database"/>, + <xref href="impala_use.xml#use"/>, <xref href="impala_show.xml#show_databases"/>, + <xref href="impala_tables.xml#tables"/> + </p> + </conbody> +</concept>
