http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_live_summary.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_live_summary.html b/docs/build/html/topics/impala_live_summary.html new file mode 100644 index 0000000..cb41693 --- /dev/null +++ b/docs/build/html/topics/impala_live_summary.html @@ -0,0 +1,177 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="live_summary"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>LIVE_SUMMARY Query Option (Impala 2.3 or higher only)</title></head><body id="live_summary"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">LIVE_SUMMARY Query Option (<span class="keyword">Impala 2.3</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + For queries submitted through the <span class="keyword cmdname">impala-shell</span> command, + displays the same output as the <code class="ph codeph">SUMMARY</code> command, + with the measurements updated in real time as the query progresses. + When the query finishes, the final <code class="ph codeph">SUMMARY</code> output remains + visible in the <span class="keyword cmdname">impala-shell</span> console output. + </p> + + <p class="p"> + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>; + any other value interpreted as <code class="ph codeph">false</code> + </p> + <p class="p"> + <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement) + </p> + + <p class="p"> + <strong class="ph b">Command-line equivalent:</strong> + </p> + <p class="p"> + You can enable this query option within <span class="keyword cmdname">impala-shell</span> + by starting the shell with the <code class="ph codeph">--live_summary</code> + command-line option. + You can still turn this setting off and on again within the shell through the + <code class="ph codeph">SET</code> command. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + <p class="p"> + The live summary output can be useful for evaluating long-running queries, + to evaluate which phase of execution takes up the most time, or if some hosts + take much longer than others for certain operations, dragging overall performance down. + By making the information available in real time, this feature lets you decide what + action to take even before you cancel a query that is taking much longer than normal. + </p> + <p class="p"> + For example, you might see the HDFS scan phase taking a long time, and therefore revisit + performance-related aspects of your schema design such as constructing a partitioned table, + switching to the Parquet file format, running the <code class="ph codeph">COMPUTE STATS</code> statement + for the table, and so on. + Or you might see a wide variation between the average and maximum times for all hosts to + perform some phase of the query, and therefore investigate if one particular host + needed more memory or was experiencing a network problem. + </p> + <p class="p"> + The output from this query option is printed to standard error. The output is only displayed in interactive mode, + that is, not when the <code class="ph codeph">-q</code> or <code class="ph codeph">-f</code> options are used. + </p> + <p class="p"> + For a simple and concise way of tracking the progress of an interactive query, see + <a class="xref" href="impala_live_progress.html#live_progress">LIVE_PROGRESS Query Option (Impala 2.3 or higher only)</a>. + </p> + + <p class="p"> + <strong class="ph b">Restrictions:</strong> + </p> + <p class="p"> + The <code class="ph codeph">LIVE_PROGRESS</code> and <code class="ph codeph">LIVE_SUMMARY</code> query options + currently do not produce any output during <code class="ph codeph">COMPUTE STATS</code> operations. + </p> + <div class="p"> + Because the <code class="ph codeph">LIVE_PROGRESS</code> and <code class="ph codeph">LIVE_SUMMARY</code> query options + are available only within the <span class="keyword cmdname">impala-shell</span> interpreter: + <ul class="ul"> + <li class="li"> + <p class="p"> + You cannot change these query options through the SQL <code class="ph codeph">SET</code> + statement using the JDBC or ODBC interfaces. The <code class="ph codeph">SET</code> + command in <span class="keyword cmdname">impala-shell</span> recognizes these names as + shell-only options. + </p> + </li> + <li class="li"> + <p class="p"> + Be careful when using <span class="keyword cmdname">impala-shell</span> on a pre-<span class="keyword">Impala 2.3</span> + system to connect to a system running <span class="keyword">Impala 2.3</span> or higher. + The older <span class="keyword cmdname">impala-shell</span> does not recognize these + query option names. Upgrade <span class="keyword cmdname">impala-shell</span> on the + systems where you intend to use these query options. + </p> + </li> + <li class="li"> + <p class="p"> + Likewise, the <span class="keyword cmdname">impala-shell</span> command relies on + some information only available in <span class="keyword">Impala 2.3</span> and higher + to prepare live progress reports and query summaries. The + <code class="ph codeph">LIVE_PROGRESS</code> and <code class="ph codeph">LIVE_SUMMARY</code> + query options have no effect when <span class="keyword cmdname">impala-shell</span> connects + to a cluster running an older version of Impala. + </p> + </li> + </ul> + </div> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.3.0</span> + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + The following example shows a series of <code class="ph codeph">LIVE_SUMMARY</code> reports that + are displayed during the course of a query, showing how the numbers increase to + show the progress of different phases of the distributed query. When you do the same + in <span class="keyword cmdname">impala-shell</span>, only a single report is displayed at any one time, + with each update overwriting the previous numbers. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > set live_summary=true; +LIVE_SUMMARY set to true +[localhost:21000] > select count(*) from customer t1 cross join customer t2; ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | 10.00 MB | | +| 02:NESTED LOOP JOIN | 0 | 0ns | 0ns | 0 | 22.50B | 0 B | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 0 | 0ns | 0ns | 0 | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 0 | 0ns | 0ns | 0 | 150.00K | 0 B | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 17.62s | 17.62s | 81.14M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.53ms | 247.53ms | 1.02K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ +| 06:AGGREGATE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | FINALIZE | +| 05:EXCHANGE | 0 | 0ns | 0ns | 0 | 1 | 0 B | -1 B | UNPARTITIONED | +| 03:AGGREGATE | 1 | 0ns | 0ns | 0 | 1 | 20.00 KB | 10.00 MB | | +| 02:NESTED LOOP JOIN | 1 | 61.85s | 61.85s | 283.43M | 22.50B | 3.23 MB | 0 B | CROSS JOIN, BROADCAST | +| |--04:EXCHANGE | 1 | 26.29ms | 26.29ms | 150.00K | 150.00K | 0 B | 0 B | BROADCAST | +| | 01:SCAN HDFS | 1 | 503.57ms | 503.57ms | 150.00K | 150.00K | 24.09 MB | 64.00 MB | tpch.customer t2 | +| 00:SCAN HDFS | 1 | 247.59ms | 247.59ms | 2.05K | 150.00K | 24.39 MB | 64.00 MB | tpch.customer t1 | ++---------------------+--------+----------+----------+---------+------------+----------+---------------+-----------------------+ + +</code></pre> + + + + + <p class="p"> + To see how the <code class="ph codeph">LIVE_PROGRESS</code> and <code class="ph codeph">LIVE_SUMMARY</code> query options + work in real time, see <a class="xref" href="https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk" target="_blank">this animated demo</a>. + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_load_data.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_load_data.html b/docs/build/html/topics/impala_load_data.html new file mode 100644 index 0000000..e49408b --- /dev/null +++ b/docs/build/html/topics/impala_load_data.html @@ -0,0 +1,306 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="load_data"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>LOAD DATA Statement</title></head><body id="load_data"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">LOAD DATA Statement</h1> + + + + <div class="body conbody"> + + <p class="p"> + + The <code class="ph codeph">LOAD DATA</code> statement streamlines the ETL process for an internal Impala table by moving a + data file or all the data files in a directory from an HDFS location into the Impala data directory for that + table. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + +<pre class="pre codeblock"><code>LOAD DATA INPATH '<var class="keyword varname">hdfs_file_or_directory_path</var>' [OVERWRITE] INTO TABLE <var class="keyword varname">tablename</var> + [PARTITION (<var class="keyword varname">partcol1</var>=<var class="keyword varname">val1</var>, <var class="keyword varname">partcol2</var>=<var class="keyword varname">val2</var> ...)]</code></pre> + + <p class="p"> + When the <code class="ph codeph">LOAD DATA</code> statement operates on a partitioned table, + it always operates on one partition at a time. Specify the <code class="ph codeph">PARTITION</code> clauses + and list all the partition key columns, with a constant value specified for each. + </p> + + <p class="p"> + <strong class="ph b">Statement type:</strong> DML (but still affected by + <a class="xref" href="../shared/../topics/impala_sync_ddl.html#sync_ddl">SYNC_DDL</a> query option) + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <ul class="ul"> + <li class="li"> + The loaded data files are moved, not copied, into the Impala data directory. + </li> + + <li class="li"> + You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the + files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a + directory. When loading a directory full of data files, keep all the data files at the top level, with no + nested directories underneath. + </li> + + <li class="li"> + Currently, the Impala <code class="ph codeph">LOAD DATA</code> statement only imports files from HDFS, not from the local + filesystem. It does not support the <code class="ph codeph">LOCAL</code> keyword of the Hive <code class="ph codeph">LOAD DATA</code> + statement. You must specify a path, not an <code class="ph codeph">hdfs://</code> URI. + </li> + + <li class="li"> + In the interest of speed, only limited error checking is done. If the loaded files have the wrong file + format, different columns than the destination table, or other kind of mismatch, Impala does not raise any + error for the <code class="ph codeph">LOAD DATA</code> statement. Querying the table afterward could produce a runtime + error or unexpected results. Currently, the only checking the <code class="ph codeph">LOAD DATA</code> statement does is + to avoid mixing together uncompressed and LZO-compressed text files in the same table. + </li> + + <li class="li"> + When you specify an HDFS directory name as the <code class="ph codeph">LOAD DATA</code> argument, any hidden files in + that directory (files whose names start with a <code class="ph codeph">.</code>) are not moved to the Impala data + directory. + </li> + + <li class="li"> + The operation fails if the source directory contains any non-hidden directories. + Prior to <span class="keyword">Impala 2.5</span> if the source directory contained any subdirectory, even a hidden one such as + <span class="ph filepath">_impala_insert_staging</span>, the <code class="ph codeph">LOAD DATA</code> statement would fail. + In <span class="keyword">Impala 2.5</span> and higher, <code class="ph codeph">LOAD DATA</code> ignores hidden subdirectories in the + source directory, and only fails if any of the subdirectories are non-hidden. + </li> + + <li class="li"> + The loaded data files retain their original names in the new location, unless a name conflicts with an + existing data file, in which case the name of the new file is modified slightly to be unique. (The + name-mangling is a slight difference from the Hive <code class="ph codeph">LOAD DATA</code> statement, which replaces + identically named files.) + </li> + + <li class="li"> + By providing an easy way to transport files from known locations in HDFS into the Impala data directory + structure, the <code class="ph codeph">LOAD DATA</code> statement lets you avoid memorizing the locations and layout of + HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of + the data files for an Impala table, issue the statement <code class="ph codeph">DESCRIBE FORMATTED + <var class="keyword varname">table_name</var></code>.) + </li> + + <li class="li"> + The <code class="ph codeph">PARTITION</code> clause is especially convenient for ingesting new data for a partitioned + table. As you receive new data for a time period, geographic region, or other division that corresponds to + one or more partitioning columns, you can load that data straight into the appropriate Impala data + directory, which might be nested several levels down if the table is partitioned by multiple columns. When + the table is partitioned, you must specify constant values for all the partitioning columns. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Complex type considerations:</strong> + </p> + + <p class="p"> + Because Impala currently cannot create Parquet data files containing complex types + (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>), the + <code class="ph codeph">LOAD DATA</code> statement is especially important when working with + tables containing complex type columns. You create the Parquet data files outside + Impala, then use either <code class="ph codeph">LOAD DATA</code>, an external table, or HDFS-level + file operations followed by <code class="ph codeph">REFRESH</code> to associate the data files with + the corresponding table. + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details about using complex types. + </p> + + <p class="p"> + If you connect to different Impala nodes within an <span class="keyword cmdname">impala-shell</span> session for + load-balancing purposes, you can enable the <code class="ph codeph">SYNC_DDL</code> query option to make each DDL + statement wait before returning, until the new or changed metadata has been received by all the Impala + nodes. See <a class="xref" href="../shared/../topics/impala_sync_ddl.html#sync_ddl">SYNC_DDL Query Option</a> for details. + </p> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + After adding or replacing data in a table used in performance-critical queries, issue a <code class="ph codeph">COMPUTE + STATS</code> statement to make sure all statistics are up-to-date. Consider updating statistics for a + table after any <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, or <code class="ph codeph">CREATE TABLE AS + SELECT</code> statement in Impala, or after loading data through Hive and doing a <code class="ph codeph">REFRESH + <var class="keyword varname">table_name</var></code> in Impala. This technique is especially important for tables that + are very large, used in join queries, or both. + </div> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + First, we use a trivial Python script to write different numbers of strings (one per line) into files stored + in the <code class="ph codeph">doc_demo</code> HDFS user account. (Substitute the path for your own HDFS user account when + doing <span class="keyword cmdname">hdfs dfs</span> operations like these.) + </p> + +<pre class="pre codeblock"><code>$ random_strings.py 1000 | hdfs dfs -put - /user/doc_demo/thousand_strings.txt +$ random_strings.py 100 | hdfs dfs -put - /user/doc_demo/hundred_strings.txt +$ random_strings.py 10 | hdfs dfs -put - /user/doc_demo/ten_strings.txt</code></pre> + + <p class="p"> + Next, we create a table and load an initial set of data into it. Remember, unless you specify a + <code class="ph codeph">STORED AS</code> clause, Impala tables default to <code class="ph codeph">TEXTFILE</code> format with Ctrl-A (hex + 01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For + large-scale ETL jobs, you would typically use binary format data files such as Parquet or Avro, and load them + into Impala tables that use the corresponding file format. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (s string); +[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.61s +[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1; +Query finished, fetching results ... ++------+ +| _c0 | ++------+ +| 1000 | ++------+ +Returned 1 row(s) in 0.67s +[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1; +ERROR: AnalysisException: INPATH location '/user/doc_demo/thousand_strings.txt' does not exist. </code></pre> + + <p class="p"> + As indicated by the message at the end of the previous example, the data file was moved from its original + location. The following example illustrates how the data file was moved into the Impala data directory for + the destination table, keeping its original filename: + </p> + +<pre class="pre codeblock"><code>$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1 +Found 1 items +-rw-r--r-- 1 doc_demo doc_demo 13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt</code></pre> + + <p class="p"> + The following example demonstrates the difference between the <code class="ph codeph">INTO TABLE</code> and + <code class="ph codeph">OVERWRITE TABLE</code> clauses. The table already contains 1000 rows. After issuing the + <code class="ph codeph">LOAD DATA</code> statement with the <code class="ph codeph">INTO TABLE</code> clause, the table contains 100 more + rows, for a total of 1100. After issuing the <code class="ph codeph">LOAD DATA</code> statement with the <code class="ph codeph">OVERWRITE + INTO TABLE</code> clause, the former contents are gone, and now the table only contains the 10 rows from + the just-loaded data file. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > load data inpath '/user/doc_demo/hundred_strings.txt' into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 2 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.24s +[localhost:21000] > select count(*) from t1; +Query finished, fetching results ... ++------+ +| _c0 | ++------+ +| 1100 | ++------+ +Returned 1 row(s) in 0.55s +[localhost:21000] > load data inpath '/user/doc_demo/ten_strings.txt' overwrite into table t1; +Query finished, fetching results ... ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ +Returned 1 row(s) in 0.26s +[localhost:21000] > select count(*) from t1; +Query finished, fetching results ... ++-----+ +| _c0 | ++-----+ +| 10 | ++-----+ +Returned 1 row(s) in 0.62s</code></pre> + + <p class="p"> + <strong class="ph b">Amazon S3 considerations:</strong> + </p> + <p class="p"> + In <span class="keyword">Impala 2.6</span> and higher, the Impala DML statements (<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, + and <code class="ph codeph">CREATE TABLE AS SELECT</code>) can write data into a table or partition that resides in the + Amazon Simple Storage Service (S3). + The syntax of the DML statements is the same as for any other tables, because the S3 location for tables and + partitions is specified by an <code class="ph codeph">s3a://</code> prefix in the + <code class="ph codeph">LOCATION</code> attribute of + <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements. + If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements, + issue a <code class="ph codeph">REFRESH</code> statement for the table before using Impala to query the S3 data. + </p> + <p class="p"> + Because of differences between S3 and traditional filesystems, DML operations + for S3 tables can take longer than for tables on HDFS. For example, both the + <code class="ph codeph">LOAD DATA</code> statement and the final stage of the <code class="ph codeph">INSERT</code> + and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements involve moving files from one directory + to another. (In the case of <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code>, + the files are moved from a temporary staging directory to the final destination directory.) + Because S3 does not support a <span class="q">"rename"</span> operation for existing objects, in these cases Impala + actually copies the data files from one location to another and then removes the original files. + In <span class="keyword">Impala 2.6</span>, the <code class="ph codeph">S3_SKIP_INSERT_STAGING</code> query option provides a way + to speed up <code class="ph codeph">INSERT</code> statements for S3 tables and partitions, with the tradeoff + that a problem during statement execution could leave data in an inconsistent state. + It does not apply to <code class="ph codeph">INSERT OVERWRITE</code> or <code class="ph codeph">LOAD DATA</code> statements. + See <a class="xref" href="../shared/../topics/impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a> for details. + </p> + <p class="p">See <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about reading and writing S3 data with Impala.</p> + + <p class="p"> + <strong class="ph b">Cancellation:</strong> Cannot be cancelled. + </p> + + <p class="p"> + <strong class="ph b">HDFS permissions:</strong> + </p> + <p class="p"> + The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under, + typically the <code class="ph codeph">impala</code> user, must have read and write + permissions for the files in the source directory, and write + permission for the destination directory. + </p> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + <p class="p"> + The <code class="ph codeph">LOAD DATA</code> statement cannot be used with Kudu tables. + </p> + + <p class="p"> + <strong class="ph b">HBase considerations:</strong> + </p> + <p class="p"> + The <code class="ph codeph">LOAD DATA</code> statement cannot be used with HBase tables. + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + The <code class="ph codeph">LOAD DATA</code> statement is an alternative to the + <code class="ph codeph"><a class="xref" href="impala_insert.html#insert">INSERT</a></code> statement. + Use <code class="ph codeph">LOAD DATA</code> + when you have the data files in HDFS but outside of any Impala table. + </p> + <p class="p"> + The <code class="ph codeph">LOAD DATA</code> statement is also an alternative + to the <code class="ph codeph">CREATE EXTERNAL TABLE</code> statement. Use + <code class="ph codeph">LOAD DATA</code> when it is appropriate to move the + data files under Impala control rather than querying them + from their original location. See <a class="xref" href="impala_tables.html#external_tables">External Tables</a> + for information about working with external tables. + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_logging.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_logging.html b/docs/build/html/topics/impala_logging.html new file mode 100644 index 0000000..049f370 --- /dev/null +++ b/docs/build/html/topics/impala_logging.html @@ -0,0 +1,416 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impa la 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="logging"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala Logging</title></head><body id="logging"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Using Impala Logging</h1> + + + + <div class="body conbody"> + + <p class="p"> + The Impala logs record information about: + </p> + + <ul class="ul"> + <li class="li"> + Any errors Impala encountered. If Impala experienced a serious error during startup, you must diagnose and + troubleshoot that problem before you can do anything further with Impala. + </li> + + <li class="li"> + How Impala is configured. + </li> + + <li class="li"> + Jobs Impala has completed. + </li> + </ul> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Formerly, the logs contained the query profile for each query, showing low-level details of how the work is + distributed among nodes and how intermediate and final results are transmitted across the network. To save + space, those query profiles are now stored in zlib-compressed files in + <span class="ph filepath">/var/log/impala/profiles</span>. You can access them through the Impala web user interface. + For example, at <code class="ph codeph">http://<var class="keyword varname">impalad-node-hostname</var>:25000/queries</code>, each query + is followed by a <code class="ph codeph">Profile</code> link leading to a page showing extensive analytical data for the + query execution. + </p> + + <p class="p"> + The auditing feature introduced in Impala 1.1.1 produces a separate set of audit log files when + enabled. See <a class="xref" href="impala_auditing.html#auditing">Auditing Impala Operations</a> for details. + </p> + + <p class="p"> + The lineage feature introduced in Impala 2.2.0 produces a separate lineage log file when + enabled. See <a class="xref" href="impala_lineage.html#lineage">Viewing Lineage Information for Impala Data</a> for details. + </p> + </div> + + <p class="p toc inpage"></p> + + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="logging__logs_details"> + + <h2 class="title topictitle2" id="ariaid-title2">Locations and Names of Impala Log Files</h2> + + <div class="body conbody"> + + <ul class="ul"> + <li class="li"> + By default, the log files are under the directory <span class="ph filepath">/var/log/impala</span>. + To change log file locations, modify the defaults file described in + <a class="xref" href="impala_processes.html#processes">Starting Impala</a>. + </li> + + <li class="li"> + The significant files for the <code class="ph codeph">impalad</code> process are <span class="ph filepath">impalad.INFO</span>, + <span class="ph filepath">impalad.WARNING</span>, and <span class="ph filepath">impalad.ERROR</span>. You might also see a file + <span class="ph filepath">impalad.FATAL</span>, although this is only present in rare conditions. + </li> + + <li class="li"> + The significant files for the <code class="ph codeph">statestored</code> process are + <span class="ph filepath">statestored.INFO</span>, <span class="ph filepath">statestored.WARNING</span>, and + <span class="ph filepath">statestored.ERROR</span>. You might also see a file <span class="ph filepath">statestored.FATAL</span>, + although this is only present in rare conditions. + </li> + + <li class="li"> + The significant files for the <code class="ph codeph">catalogd</code> process are <span class="ph filepath">catalogd.INFO</span>, + <span class="ph filepath">catalogd.WARNING</span>, and <span class="ph filepath">catalogd.ERROR</span>. You might also see a file + <span class="ph filepath">catalogd.FATAL</span>, although this is only present in rare conditions. + </li> + + <li class="li"> + Examine the <code class="ph codeph">.INFO</code> files to see configuration settings for the processes. + </li> + + <li class="li"> + Examine the <code class="ph codeph">.WARNING</code> files to see all kinds of problem information, including such + things as suboptimal settings and also serious runtime errors. + </li> + + <li class="li"> + Examine the <code class="ph codeph">.ERROR</code> and/or <code class="ph codeph">.FATAL</code> files to see only the most serious + errors, if the processes crash, or queries fail to complete. These messages are also in the + <code class="ph codeph">.WARNING</code> file. + </li> + + <li class="li"> + A new set of log files is produced each time the associated daemon is restarted. These log files have + long names including a timestamp. The <code class="ph codeph">.INFO</code>, <code class="ph codeph">.WARNING</code>, and + <code class="ph codeph">.ERROR</code> files are physically represented as symbolic links to the latest applicable log + files. + </li> + + <li class="li"> + The init script for the <code class="ph codeph">impala-server</code> service also produces a consolidated log file + <code class="ph codeph">/var/logs/impalad/impala-server.log</code>, with all the same information as the + corresponding<code class="ph codeph">.INFO</code>, <code class="ph codeph">.WARNING</code>, and <code class="ph codeph">.ERROR</code> files. + </li> + + <li class="li"> + The init script for the <code class="ph codeph">impala-state-store</code> service also produces a consolidated log file + <code class="ph codeph">/var/logs/impalad/impala-state-store.log</code>, with all the same information as the + corresponding<code class="ph codeph">.INFO</code>, <code class="ph codeph">.WARNING</code>, and <code class="ph codeph">.ERROR</code> files. + </li> + </ul> + + <p class="p"> + Impala stores information using the <code class="ph codeph">glog_v</code> logging system. You will see some messages + referring to C++ file names. Logging is affected by: + </p> + + <ul class="ul"> + <li class="li"> + The <code class="ph codeph">GLOG_v</code> environment variable specifies which types of messages are logged. See + <a class="xref" href="#log_levels">Setting Logging Levels</a> for details. + </li> + + <li class="li"> + The <code class="ph codeph">-logbuflevel</code> startup flag for the <span class="keyword cmdname">impalad</span> daemon specifies how + often the log information is written to disk. The default is 0, meaning that the log is immediately + flushed to disk when Impala outputs an important messages such as a warning or an error, but less + important messages such as informational ones are buffered in memory rather than being flushed to disk + immediately. + </li> + </ul> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="logging__logs_managing"> + + <h2 class="title topictitle2" id="ariaid-title3">Managing Impala Logs</h2> + + + <div class="body conbody"> + + <p class="p"> + Review Impala log files on each host, when you have traced an issue back to a specific system. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="logging__logs_rotate"> + + <h2 class="title topictitle2" id="ariaid-title4">Rotating Impala Logs</h2> + + + <div class="body conbody"> + + <p class="p"> + Impala periodically switches the physical files representing the current log files, after which it is safe + to remove the old files if they are no longer needed. + </p> + + <p class="p"> + Impala can automatically remove older unneeded log files, a feature known as <dfn class="term">log rotation</dfn>. + + </p> + + <p class="p"> + In Impala 2.2 and higher, the <code class="ph codeph">-max_log_files</code> configuration option specifies how many log + files to keep at each severity level. You can specify an appropriate setting for each Impala-related daemon + (<span class="keyword cmdname">impalad</span>, <span class="keyword cmdname">statestored</span>, and <span class="keyword cmdname">catalogd</span>). The default + value is 10, meaning that Impala preserves the latest 10 log files for each severity level + (<code class="ph codeph">INFO</code>, <code class="ph codeph">WARNING</code>, <code class="ph codeph">ERROR</code>, and <code class="ph codeph">FATAL</code>). + Impala checks to see if any old logs need to be removed based on the interval specified in the + <code class="ph codeph">logbufsecs</code> setting, every 5 seconds by default. + </p> + + + + <p class="p"> + A value of 0 preserves all log files, in which case you would set up set up manual log rotation using your + Linux tool or technique of choice. A value of 1 preserves only the very latest log file. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="logging__logs_debug"> + + <h2 class="title topictitle2" id="ariaid-title5">Reviewing Impala Logs</h2> + + <div class="body conbody"> + + <p class="p"> + By default, the Impala log is stored at <code class="ph codeph">/var/logs/impalad/</code>. The most comprehensive log, + showing informational, warning, and error messages, is in the file name <span class="ph filepath">impalad.INFO</span>. + View log file contents by using the web interface or by examining the contents of the log file. (When you + examine the logs through the file system, you can troubleshoot problems by reading the + <span class="ph filepath">impalad.WARNING</span> and/or <span class="ph filepath">impalad.ERROR</span> files, which contain the + subsets of messages indicating potential problems.) + </p> + + <p class="p"> + On a machine named <code class="ph codeph">impala.example.com</code> with default settings, you could view the Impala + logs on that machine by using a browser to access <code class="ph codeph">http://impala.example.com:25000/logs</code>. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + The web interface limits the amount of logging information displayed. To view every log entry, access the + log files directly through the file system. + </p> + </div> + + <p class="p"> + You can view the contents of the <code class="ph codeph">impalad.INFO</code> log file in the file system. With the + default configuration settings, the start of the log file appears as follows: + </p> + +<pre class="pre codeblock"><code>[user@example impalad]$ pwd +/var/log/impalad +[user@example impalad]$ more impalad.INFO +Log file created at: 2013/01/07 08:42:12 +Running on machine: impala.example.com +Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg +I0107 08:42:12.292155 14876 daemon.cc:34] impalad version 0.4 RELEASE (build 9d7fadca0461ab40b9e9df8cdb47107ec6b27cff) +Built on Fri, 21 Dec 2012 12:55:19 PST +I0107 08:42:12.292484 14876 daemon.cc:35] Using hostname: impala.example.com +I0107 08:42:12.292706 14876 logging.cc:76] Flags (see also /varz are on debug webserver): +--dump_ir=false +--module_output= +--be_port=22000 +--classpath= +--hostname=impala.example.com</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + The preceding example shows only a small part of the log file. Impala log files are often several megabytes + in size. + </div> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="logging__log_format"> + + <h2 class="title topictitle2" id="ariaid-title6">Understanding Impala Log Contents</h2> + + <div class="body conbody"> + + <p class="p"> + The logs store information about Impala startup options. This information appears once for each time Impala + is started and may include: + </p> + + <ul class="ul"> + <li class="li"> + Machine name. + </li> + + <li class="li"> + Impala version number. + </li> + + <li class="li"> + Flags used to start Impala. + </li> + + <li class="li"> + CPU information. + </li> + + <li class="li"> + The number of available disks. + </li> + </ul> + + <p class="p"> + There is information about each job Impala has run. Because each Impala job creates an additional set of + data about queries, the amount of job specific data may be very large. Logs may contained detailed + information on jobs. These detailed log entries may include: + </p> + + <ul class="ul"> + <li class="li"> + The composition of the query. + </li> + + <li class="li"> + The degree of data locality. + </li> + + <li class="li"> + Statistics on data throughput and response times. + </li> + </ul> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="logging__log_levels"> + + <h2 class="title topictitle2" id="ariaid-title7">Setting Logging Levels</h2> + + <div class="body conbody"> + + <p class="p"> + Impala uses the GLOG system, which supports three logging levels. You can adjust logging levels + by exporting variable settings. To change logging settings manually, use a command + similar to the following on each node before starting <code class="ph codeph">impalad</code>: + </p> + +<pre class="pre codeblock"><code>export GLOG_v=1</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + For performance reasons, do not enable the most verbose logging level of 3 unless there is + no other alternative for troubleshooting. + </div> + + <p class="p"> + For more information on how to configure GLOG, including how to set variable logging levels for different + system components, see + <a class="xref" href="http://google-glog.googlecode.com/svn/trunk/doc/glog.html" target="_blank">How + To Use Google Logging Library (glog)</a>. + </p> + + <section class="section" id="log_levels__loglevels_details"><h3 class="title sectiontitle">Understanding What is Logged at Different Logging Levels</h3> + + + + <p class="p"> + As logging levels increase, the categories of information logged are cumulative. For example, GLOG_v=2 + records everything GLOG_v=1 records, as well as additional information. + </p> + + <p class="p"> + Increasing logging levels imposes performance overhead and increases log size. Where practical, use + GLOG_v=1 for most cases: this level has minimal performance impact but still captures useful + troubleshooting information. + </p> + + <p class="p"> + Additional information logged at each level is as follows: + </p> + + <ul class="ul"> + <li class="li"> + GLOG_v=1 - The default level. Logs information about each connection and query that is initiated to an + <code class="ph codeph">impalad</code> instance, including runtime profiles. + </li> + + <li class="li"> + GLOG_v=2 - Everything from the previous level plus information for each RPC initiated. This level also + records query execution progress information, including details on each file that is read. + </li> + + <li class="li"> + GLOG_v=3 - Everything from the previous level plus logging of every row that is read. This level is + only applicable for the most serious troubleshooting and tuning scenarios, because it can produce + exceptionally large and detailed log files, potentially leading to its own set of performance and + capacity problems. + </li> + </ul> + + </section> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="logging__redaction"> + + <h2 class="title topictitle2" id="ariaid-title8">Redacting Sensitive Information from Impala Log Files</h2> + + + <div class="body conbody"> + + <p class="p"> + + <dfn class="term">Log redaction</dfn> is a security feature that prevents sensitive information from being displayed in + locations used by administrators for monitoring and troubleshooting, such as log files and the Impala debug web + user interface. You configure regular expressions that match sensitive types of information processed by your + system, such as credit card numbers or tax IDs, and literals matching these patterns are obfuscated wherever + they would normally be recorded in log files or displayed in administration or debugging user interfaces. + </p> + + <p class="p"> + In a security context, the log redaction feature is complementary to the Sentry authorization framework. + Sentry prevents unauthorized users from being able to directly access table data. Redaction prevents + administrators or support personnel from seeing the smaller amounts of sensitive or personally identifying + information (PII) that might appear in queries issued by those authorized users. + </p> + + <p class="p"> + See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details about how to enable this feature and set + up the regular expressions to detect and redact sensitive information within SQL statement text. + </p> + + </div> + + </article> + +</article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_map.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_map.html b/docs/build/html/topics/impala_map.html new file mode 100644 index 0000000..31d91f0 --- /dev/null +++ b/docs/build/html/topics/impala_map.html @@ -0,0 +1,331 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_datatypes.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="map"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>MAP Complex Type (Impala 2.3 or higher only)</title></head><body id="map"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">MAP Complex Type (<span class="keyword">Impala 2.3</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + A complex data type representing an arbitrary set of key-value pairs. + The key part is a scalar type, while the value part can be a scalar or + another complex type (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, + or <code class="ph codeph">MAP</code>). + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + +<pre class="pre codeblock"><code><var class="keyword varname">column_name</var> MAP < <var class="keyword varname">primitive_type</var>, <var class="keyword varname">type</var> > + +type ::= <var class="keyword varname">primitive_type</var> | <var class="keyword varname">complex_type</var> +</code></pre> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + Because complex types are often used in combination, + for example an <code class="ph codeph">ARRAY</code> of <code class="ph codeph">STRUCT</code> + elements, if you are unfamiliar with the Impala complex types, + start with <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for + background information and usage examples. + </p> + + <p class="p"> + The <code class="ph codeph">MAP</code> complex data type represents a set of key-value pairs. + Each element of the map is indexed by a primitive type such as <code class="ph codeph">BIGINT</code> or + <code class="ph codeph">STRING</code>, letting you define sequences that are not continuous or categories with arbitrary names. + You might find it convenient for modelling data produced in other languages, such as a + Python dictionary or Java HashMap, where a single scalar value serves as the lookup key. + </p> + + <p class="p"> + In a big data context, the keys in a map column might represent a numeric sequence of events during a + manufacturing process, or <code class="ph codeph">TIMESTAMP</code> values corresponding to sensor observations. + The map itself is inherently unordered, so you choose whether to make the key values significant + (such as a recorded <code class="ph codeph">TIMESTAMP</code>) or synthetic (such as a random global universal ID). + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Behind the scenes, the <code class="ph codeph">MAP</code> type is implemented in a similar way as the + <code class="ph codeph">ARRAY</code> type. Impala does not enforce any uniqueness constraint on the + <code class="ph codeph">KEY</code> values, and the <code class="ph codeph">KEY</code> values are processed by + looping through the elements of the <code class="ph codeph">MAP</code> rather than by a constant-time lookup. + Therefore, this type is primarily for ease of understanding when importing data and + algorithms from non-SQL contexts, rather than optimizing the performance of key lookups. + </div> + + <p class="p"> + You can pass a multi-part qualified name to <code class="ph codeph">DESCRIBE</code> + to specify an <code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, or <code class="ph codeph">MAP</code> + column and visualize its structure as if it were a table. + For example, if table <code class="ph codeph">T1</code> contains an <code class="ph codeph">ARRAY</code> column + <code class="ph codeph">A1</code>, you could issue the statement <code class="ph codeph">DESCRIBE t1.a1</code>. + If table <code class="ph codeph">T1</code> contained a <code class="ph codeph">STRUCT</code> column <code class="ph codeph">S1</code>, + and a field <code class="ph codeph">F1</code> within the <code class="ph codeph">STRUCT</code> was a <code class="ph codeph">MAP</code>, + you could issue the statement <code class="ph codeph">DESCRIBE t1.s1.f1</code>. + An <code class="ph codeph">ARRAY</code> is shown as a two-column table, with + <code class="ph codeph">ITEM</code> and <code class="ph codeph">POS</code> columns. + A <code class="ph codeph">STRUCT</code> is shown as a table with each field + representing a column in the table. + A <code class="ph codeph">MAP</code> is shown as a two-column table, with + <code class="ph codeph">KEY</code> and <code class="ph codeph">VALUE</code> columns. + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.3.0</span> + </p> + + <p class="p"> + <strong class="ph b">Restrictions:</strong> + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + Columns with this data type can only be used in tables or partitions with the Parquet file format. + </p> + </li> + <li class="li"> + <p class="p"> + Columns with this data type cannot be used as partition key columns in a partitioned table. + </p> + </li> + <li class="li"> + <p class="p"> + The <code class="ph codeph">COMPUTE STATS</code> statement does not produce any statistics for columns of this data type. + </p> + </li> + <li class="li"> + <p class="p" id="map__d6e2889"> + The maximum length of the column definition for any complex type, including declarations for any nested types, + is 4000 characters. + </p> + </li> + <li class="li"> + <p class="p"> + See <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_types_limits">Limitations and Restrictions for Complex Types</a> for a full list of limitations + and associated guidelines about complex type columns. + </p> + </li> + </ul> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + <p class="p"> + Currently, the data types <code class="ph codeph">DECIMAL</code>, <code class="ph codeph">TIMESTAMP</code>, <code class="ph codeph">CHAR</code>, <code class="ph codeph">VARCHAR</code>, + <code class="ph codeph">ARRAY</code>, <code class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code> cannot be used with Kudu tables. + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Many of the complex type examples refer to tables + such as <code class="ph codeph">CUSTOMER</code> and <code class="ph codeph">REGION</code> + adapted from the tables used in the TPC-H benchmark. + See <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_sample_schema">Sample Schema and Data for Experimenting with Impala Complex Types</a> + for the table definitions. + </div> + + <p class="p"> + The following example shows a table with various kinds of <code class="ph codeph">MAP</code> columns, + both at the top level and nested within other complex types. + Each row represents information about a specific country, with complex type fields + of various levels of nesting to represent different information associated + with the country: factual measurements such as area and population, + notable people in different categories, geographic features such as + cities, points of interest within each city, and mountains with associated facts. + Practice the <code class="ph codeph">CREATE TABLE</code> and query notation for complex type columns + using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements reliably. + </p> + +<pre class="pre codeblock"><code>create TABLE map_demo +( + country_id BIGINT, + +-- Numeric facts about each country, looked up by name. +-- For example, 'Area':1000, 'Population':999999. +-- Using a MAP instead of a STRUCT because there could be +-- a different set of facts for each country. + metrics MAP <STRING, BIGINT>, + +-- MAP whose value part is an ARRAY. +-- For example, the key 'Famous Politicians' could represent an array of 10 elements, +-- while the key 'Famous Actors' could represent an array of 20 elements. + notables MAP <STRING, ARRAY <STRING>>, + +-- MAP that is a field within a STRUCT. +-- (The STRUCT is inside another ARRAY, because it is rare +-- for a STRUCT to be a top-level column.) +-- For example, city #1 might have points of interest with key 'Zoo', +-- representing an array of 3 different zoos. +-- City #2 might have completely different kinds of points of interest. +-- Because the set of field names is potentially large, and most entries could be blank, +-- a MAP makes more sense than a STRUCT to represent such a sparse data structure. + cities ARRAY < STRUCT < + name: STRING, + points_of_interest: MAP <STRING, ARRAY <STRING>> + >>, + +-- MAP that is an element within an ARRAY. The MAP is inside a STRUCT field to associate +-- the mountain name with all the facts about the mountain. +-- The "key" of the map (the first STRING field) represents the name of some fact whose value +-- can be expressed as an integer, such as 'Height', 'Year First Climbed', and so on. + mountains ARRAY < STRUCT < name: STRING, facts: MAP <STRING, INT > > > +) +STORED AS PARQUET; + +</code></pre> + +<pre class="pre codeblock"><code>DESCRIBE map_demo; ++------------+------------------------------------------------+ +| name | type | ++------------+------------------------------------------------+ +| country_id | bigint | +| metrics | map<string,bigint> | +| notables | map<string,array<string>> | +| cities | array<struct< | +| | name:string, | +| | points_of_interest:map<string,array<string>> | +| | >> | +| mountains | array<struct< | +| | name:string, | +| | facts:map<string,int> | +| | >> | ++------------+------------------------------------------------+ + +DESCRIBE map_demo.metrics; ++-------+--------+ +| name | type | ++-------+--------+ +| key | string | +| value | bigint | ++-------+--------+ + +DESCRIBE map_demo.notables; ++-------+---------------+ +| name | type | ++-------+---------------+ +| key | string | +| value | array<string> | ++-------+---------------+ + +DESCRIBE map_demo.notables.value; ++------+--------+ +| name | type | ++------+--------+ +| item | string | +| pos | bigint | ++------+--------+ + +DESCRIBE map_demo.cities; ++------+------------------------------------------------+ +| name | type | ++------+------------------------------------------------+ +| item | struct< | +| | name:string, | +| | points_of_interest:map<string,array<string>> | +| | > | +| pos | bigint | ++------+------------------------------------------------+ + +DESCRIBE map_demo.cities.item.points_of_interest; ++-------+---------------+ +| name | type | ++-------+---------------+ +| key | string | +| value | array<string> | ++-------+---------------+ + +DESCRIBE map_demo.cities.item.points_of_interest.value; ++------+--------+ +| name | type | ++------+--------+ +| item | string | +| pos | bigint | ++------+--------+ + +DESCRIBE map_demo.mountains; ++------+-------------------------+ +| name | type | ++------+-------------------------+ +| item | struct< | +| | name:string, | +| | facts:map<string,int> | +| | > | +| pos | bigint | ++------+-------------------------+ + +DESCRIBE map_demo.mountains.item.facts; ++-------+--------+ +| name | type | ++-------+--------+ +| key | string | +| value | int | ++-------+--------+ + +</code></pre> + + <p class="p"> + The following example shows a table that uses a variety of data types for the <code class="ph codeph">MAP</code> + <span class="q">"key"</span> field. Typically, you use <code class="ph codeph">BIGINT</code> or <code class="ph codeph">STRING</code> to use + numeric or character-based keys without worrying about exceeding any size or length constraints. + </p> + +<pre class="pre codeblock"><code>CREATE TABLE map_demo_obscure +( + id BIGINT, + m1 MAP <INT, INT>, + m2 MAP <SMALLINT, INT>, + m3 MAP <TINYINT, INT>, + m4 MAP <TIMESTAMP, INT>, + m5 MAP <BOOLEAN, INT>, + m6 MAP <CHAR(5), INT>, + m7 MAP <VARCHAR(25), INT>, + m8 MAP <FLOAT, INT>, + m9 MAP <DOUBLE, INT>, + m10 MAP <DECIMAL(12,2), INT> +) +STORED AS PARQUET; + +</code></pre> + +<pre class="pre codeblock"><code>CREATE TABLE celebrities (name STRING, birth_year MAP < STRING, SMALLINT >) STORED AS PARQUET; +-- A typical row might represent values with 2 different birth years, such as: +-- ("Joe Movie Star", { "real": 1972, "claimed": 1977 }) + +CREATE TABLE countries (name STRING, famous_leaders MAP < INT, STRING >) STORED AS PARQUET; +-- A typical row might represent values with different leaders, with key values corresponding to their numeric sequence, such as: +-- ("United States", { 1: "George Washington", 3: "Thomas Jefferson", 16: "Abraham Lincoln" }) + +CREATE TABLE airlines (name STRING, special_meals MAP < STRING, MAP < STRING, STRING > >) STORED AS PARQUET; +-- A typical row might represent values with multiple kinds of meals, each with several components: +-- ("Elegant Airlines", +-- { +-- "vegetarian": { "breakfast": "pancakes", "snack": "cookies", "dinner": "rice pilaf" }, +-- "gluten free": { "breakfast": "oatmeal", "snack": "fruit", "dinner": "chicken" } +-- } ) +</code></pre> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a>, + <a class="xref" href="impala_array.html#array">ARRAY Complex Type (Impala 2.3 or higher only)</a>, + <a class="xref" href="impala_struct.html#struct">STRUCT Complex Type (Impala 2.3 or higher only)</a> + + </p> + + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_datatypes.html">Data Types</a></div></div></nav></article></main></body></html> \ No newline at end of file
