http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_string_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_string_functions.xml b/docs/topics/impala_string_functions.xml new file mode 100644 index 0000000..a051ed5 --- /dev/null +++ b/docs/topics/impala_string_functions.xml @@ -0,0 +1,719 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="string_functions"> + + <title>Impala String Functions</title> + <titlealts><navtitle>String Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.0.0"> + String functions are classified as those primarily accepting or returning <codeph>STRING</codeph>, + <codeph>VARCHAR</codeph>, or <codeph>CHAR</codeph> data types, for example to measure the length of a string + or concatenate two strings together. + <ul> + <li> + All the functions that accept <codeph>STRING</codeph> arguments also accept the <codeph>VARCHAR</codeph> + and <codeph>CHAR</codeph> types introduced in Impala 2.0. + </li> + + <li> + Whenever <codeph>VARCHAR</codeph> or <codeph>CHAR</codeph> values are passed to a function that returns a + string value, the return type is normalized to <codeph>STRING</codeph>. For example, a call to + <codeph>concat()</codeph> with a mix of <codeph>STRING</codeph>, <codeph>VARCHAR</codeph>, and + <codeph>CHAR</codeph> arguments produces a <codeph>STRING</codeph> result. + </li> + </ul> + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + The string functions operate mainly on these data types: <xref href="impala_string.xml#string"/>, + <xref href="impala_varchar.xml#varchar"/>, and <xref href="impala_char.xml#char"/>. + </p> + + <p> + <b>Function reference:</b> + </p> + + <p> + Impala supports the following string functions: + </p> + + <dl> + <dlentry id="ascii"> + + <dt> + <codeph>ascii(string str)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">ascii() function</indexterm> + <b>Purpose:</b> Returns the numeric ASCII code of the first character of the argument. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry rev="2.3.0" id="btrim"> + + <dt> + <codeph>btrim(string a)</codeph>, + <codeph>btrim(string a, string chars_to_trim)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">btrim() function</indexterm> + <b>Purpose:</b> Removes all instances of one or more characters + from the start and end of a <codeph>STRING</codeph> value. + By default, removes only spaces. + If a non-<codeph>NULL</codeph> optional second argument is specified, the function removes all + occurrences of characters in that second argument from the beginning and + end of the string. + <p><b>Return type:</b> <codeph>string</codeph></p> + <p conref="../shared/impala_common.xml#common/added_in_230"/> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + The following examples show the default <codeph>btrim()</codeph> behavior, + and what changes when you specify the optional second argument. + All the examples bracket the output value with <codeph>[ ]</codeph> + so that you can see any leading or trailing spaces in the <codeph>btrim()</codeph> result. + By default, the function removes and number of both leading and trailing spaces. + When the second argument is specified, any number of occurrences of any + character in the second argument are removed from the start and end of the + input string; in this case, spaces are not removed (unless they are part of the second + argument) and any instances of the characters are not removed if they do not come + right at the beginning or end of the string. + </p> +<codeblock>-- Remove multiple spaces before and one space after. +select concat('[',btrim(' hello '),']'); ++---------------------------------------+ +| concat('[', btrim(' hello '), ']') | ++---------------------------------------+ +| [hello] | ++---------------------------------------+ + +-- Remove any instances of x or y or z at beginning or end. Leave spaces alone. +select concat('[',btrim('xy hello zyzzxx','xyz'),']'); ++------------------------------------------------------+ +| concat('[', btrim('xy hello zyzzxx', 'xyz'), ']') | ++------------------------------------------------------+ +| [ hello ] | ++------------------------------------------------------+ + +-- Remove any instances of x or y or z at beginning or end. +-- Leave x, y, z alone in the middle of the string. +select concat('[',btrim('xyhelxyzlozyzzxx','xyz'),']'); ++----------------------------------------------------+ +| concat('[', btrim('xyhelxyzlozyzzxx', 'xyz'), ']') | ++----------------------------------------------------+ +| [helxyzlo] | ++----------------------------------------------------+ +</codeblock> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="char_length"> + + <dt> + <codeph>char_length(string a), <ph rev="1.3.0" id="character_length">character_length(string a)</ph></codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">char_length() function</indexterm> + <indexterm audience="Cloudera">character_length() function</indexterm> + <b>Purpose:</b> Returns the length in characters of the argument string. Aliases for the + <codeph>length()</codeph> function. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry rev="2.3.0" id="chr"> + + <dt> + <codeph>chr(int character_code)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">chr() function</indexterm> + <b>Purpose:</b> Returns a character specified by a decimal code point value. + The interpretation and display of the resulting character depends on your system locale. + Because consistent processing of Impala string values is only guaranteed + for values within the ASCII range, only use this function for values + corresponding to ASCII characters. + In particular, parameter values greater than 255 return an empty string. + <p><b>Return type:</b> <codeph>string</codeph></p> + <p> + <b>Usage notes:</b> Can be used as the inverse of the <codeph>ascii()</codeph> function, which + converts a character to its numeric ASCII code. + </p> + <p conref="../shared/impala_common.xml#common/added_in_230"/> + <p conref="../shared/impala_common.xml#common/example_blurb"/> +<codeblock>SELECT chr(65); ++---------+ +| chr(65) | ++---------+ +| A | ++---------+ + +SELECT chr(97); ++---------+ +| chr(97) | ++---------+ +| a | ++---------+ +</codeblock> + </dd> + + </dlentry> + + <dlentry id="concat"> + + <dt> + <codeph>concat(string a, string b...)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">concat() function</indexterm> + <b>Purpose:</b> Returns a single string representing all the argument values joined together. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p conref="../shared/impala_common.xml#common/concat_blurb"/> + </dd> + + </dlentry> + + <dlentry id="concat_ws"> + + <dt> + <codeph>concat_ws(string sep, string a, string b...)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">concat_ws() function</indexterm> + <b>Purpose:</b> Returns a single string representing the second and following argument values joined + together, delimited by a specified separator. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p conref="../shared/impala_common.xml#common/concat_blurb"/> + </dd> + + </dlentry> + + <dlentry id="find_in_set"> + + <dt> + <codeph>find_in_set(string str, string strList)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">find_in_set() function</indexterm> + <b>Purpose:</b> Returns the position (starting from 1) of the first occurrence of a specified string + within a comma-separated string. Returns <codeph>NULL</codeph> if either argument is + <codeph>NULL</codeph>, 0 if the search string is not found, or 0 if the search string contains a comma. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry rev="1.2" id="group_concat"> + + <dt> + <codeph>group_concat(string s [, string sep])</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">group_concat() function</indexterm> + <b>Purpose:</b> Returns a single string representing the argument value concatenated together for each + row of the result set. If the optional separator string is specified, the separator is added between each + pair of concatenated values. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p conref="../shared/impala_common.xml#common/concat_blurb"/> + <p> + By default, returns a single string covering the whole result set. To include other columns or values + in the result set, or to produce multiple concatenated strings for subsets of rows, include a + <codeph>GROUP BY</codeph> clause in the query. + </p> + </dd> + + </dlentry> + + <dlentry rev="1.2" id="initcap"> + + <dt> + <codeph>initcap(string str)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">initcap() function</indexterm> + <b>Purpose:</b> Returns the input string with the first letter capitalized. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="instr"> + + <dt> + <codeph>instr(string str, string substr)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">instr() function</indexterm> + <b>Purpose:</b> Returns the position (starting from 1) of the first occurrence of a substring within a + longer string. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="length"> + + <dt> + <codeph>length(string a)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">length() function</indexterm> + <b>Purpose:</b> Returns the length in characters of the argument string. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="locate"> + + <dt> + <codeph>locate(string substr, string str[, int pos])</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">locate() function</indexterm> + <b>Purpose:</b> Returns the position (starting from 1) of the first occurrence of a substring within a + longer string, optionally after a particular position. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="lower"> + + <dt> + <codeph>lower(string a), <ph id="lcase">lcase(string a)</ph> </codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">lower() function</indexterm> + <b>Purpose:</b> Returns the argument string converted to all-lowercase. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="lpad"> + + <dt> + <codeph>lpad(string str, int len, string pad)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">lpad() function</indexterm> + <b>Purpose:</b> Returns a string of a specified length, based on the first argument string. If the + specified string is too short, it is padded on the left with a repeating sequence of the characters from + the pad string. If the specified string is too long, it is truncated on the right. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="ltrim"> + + <dt> + <codeph>ltrim(string a)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">ltrim() function</indexterm> + <b>Purpose:</b> Returns the argument string with any leading spaces removed from the left side. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="parse_url"> + + <dt> + <codeph>parse_url(string urlString, string partToExtract [, string keyToExtract])</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">parse_url() function</indexterm> + <b>Purpose:</b> Returns the portion of a URL corresponding to a specified part. The part argument can be + <codeph>'PROTOCOL'</codeph>, <codeph>'HOST'</codeph>, <codeph>'PATH'</codeph>, <codeph>'REF'</codeph>, + <codeph>'AUTHORITY'</codeph>, <codeph>'FILE'</codeph>, <codeph>'USERINFO'</codeph>, or + <codeph>'QUERY'</codeph>. Uppercase is required for these literal values. When requesting the + <codeph>QUERY</codeph> portion of the URL, you can optionally specify a key to retrieve just the + associated value from the key-value pairs in the query string. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p> + <b>Usage notes:</b> This function is important for the traditional Hadoop use case of interpreting web + logs. For example, if the web traffic data features raw URLs not divided into separate table columns, + you can count visitors to a particular page by extracting the <codeph>'PATH'</codeph> or + <codeph>'FILE'</codeph> field, or analyze search terms by extracting the corresponding key from the + <codeph>'QUERY'</codeph> field. + </p> + </dd> + + </dlentry> + + <dlentry id="regexp_extract"> + + <dt> + <codeph>regexp_extract(string subject, string pattern, int index)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">regexp_extract() function</indexterm> + <b>Purpose:</b> Returns the specified () group from a string based on a regular expression pattern. Group + 0 refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so + on <codeph>(...)</codeph> portion. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p conref="../shared/impala_common.xml#common/regexp_re2"/> + <p conref="../shared/impala_common.xml#common/regexp_re2_warning"/> + <p conref="../shared/impala_common.xml#common/regexp_escapes"/> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + This example shows how group 0 matches the full pattern string, including the portion outside any + <codeph>()</codeph> group: + </p> +<codeblock>[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',0); ++------------------------------------------------------+ +| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 0) | ++------------------------------------------------------+ +| abcdef123ghi456 | ++------------------------------------------------------+ +Returned 1 row(s) in 0.11s</codeblock> + <p> + This example shows how group 1 matches just the contents inside the first <codeph>()</codeph> group in + the pattern string: + </p> +<codeblock>[localhost:21000] > select regexp_extract('abcdef123ghi456jkl','.*?(\\d+)',1); ++------------------------------------------------------+ +| regexp_extract('abcdef123ghi456jkl', '.*?(\\d+)', 1) | ++------------------------------------------------------+ +| 456 | ++------------------------------------------------------+ +Returned 1 row(s) in 0.11s</codeblock> + <p rev="2.0.0"> + Unlike in earlier Impala releases, the regular expression library used in Impala 2.0 and later supports + the <codeph>.*?</codeph> idiom for non-greedy matches. This example shows how a pattern string starting + with <codeph>.*?</codeph> matches the shortest possible portion of the source string, returning the + rightmost set of lowercase letters. A pattern string both starting and ending with <codeph>.*?</codeph> + finds two potential matches of equal length, and returns the first one found (the leftmost set of + lowercase letters). + </p> +<codeblock>[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+)',1); ++--------------------------------------------------------+ +| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+)', 1) | ++--------------------------------------------------------+ +| def | ++--------------------------------------------------------+ +[localhost:21000] > select regexp_extract('AbcdBCdefGHI','.*?([[:lower:]]+).*?',1); ++-----------------------------------------------------------+ +| regexp_extract('abcdbcdefghi', '.*?([[:lower:]]+).*?', 1) | ++-----------------------------------------------------------+ +| bcd | ++-----------------------------------------------------------+ +</codeblock> + </dd> + + </dlentry> + + <dlentry id="regexp_replace"> + + <dt> + <codeph>regexp_replace(string initial, string pattern, string replacement)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">regexp_replace() function</indexterm> + <b>Purpose:</b> Returns the initial argument with the regular expression pattern replaced by the final + argument string. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p conref="../shared/impala_common.xml#common/regexp_re2"/> + <p conref="../shared/impala_common.xml#common/regexp_re2_warning"/> + <p conref="../shared/impala_common.xml#common/regexp_escapes"/> + <p conref="../shared/impala_common.xml#common/example_blurb"/> + <p> + These examples show how you can replace parts of a string matching a pattern with replacement text, + which can include backreferences to any <codeph>()</codeph> groups in the pattern string. The + backreference numbers start at 1, and any <codeph>\</codeph> characters must be escaped as + <codeph>\\</codeph>. + </p> + <p> + Replace a character pattern with new text: + </p> +<codeblock>[localhost:21000] > select regexp_replace('aaabbbaaa','b+','xyz'); ++------------------------------------------+ +| regexp_replace('aaabbbaaa', 'b+', 'xyz') | ++------------------------------------------+ +| aaaxyzaaa | ++------------------------------------------+ +Returned 1 row(s) in 0.11s</codeblock> + <p> + Replace a character pattern with substitution text that includes the original matching text: + </p> +<codeblock>[localhost:21000] > select regexp_replace('aaabbbaaa','(b+)','<\\1>'); ++----------------------------------------------+ +| regexp_replace('aaabbbaaa', '(b+)', '<\\1>') | ++----------------------------------------------+ +| aaa<bbb>aaa | ++----------------------------------------------+ +Returned 1 row(s) in 0.11s</codeblock> + <p> + Remove all characters that are not digits: + </p> +<codeblock>[localhost:21000] > select regexp_replace('123-456-789','[^[:digit:]]',''); ++---------------------------------------------------+ +| regexp_replace('123-456-789', '[^[:digit:]]', '') | ++---------------------------------------------------+ +| 123456789 | ++---------------------------------------------------+ +Returned 1 row(s) in 0.12s</codeblock> + </dd> + + </dlentry> + + <dlentry id="repeat"> + + <dt> + <codeph>repeat(string str, int n)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">repeat() function</indexterm> + <b>Purpose:</b> Returns the argument string repeated a specified number of times. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="reverse"> + + <dt> + <codeph>reverse(string a)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">reverse() function</indexterm> + <b>Purpose:</b> Returns the argument string with characters in reversed order. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="rpad"> + + <dt> + <codeph>rpad(string str, int len, string pad)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">rpad() function</indexterm> + <b>Purpose:</b> Returns a string of a specified length, based on the first argument string. If the + specified string is too short, it is padded on the right with a repeating sequence of the characters from + the pad string. If the specified string is too long, it is truncated on the right. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="rtrim"> + + <dt> + <codeph>rtrim(string a)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">rtrim() function</indexterm> + <b>Purpose:</b> Returns the argument string with any trailing spaces removed from the right side. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="space"> + + <dt> + <codeph>space(int n)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">space() function</indexterm> + <b>Purpose:</b> Returns a concatenated string of the specified number of spaces. Shorthand for + <codeph>repeat(' ',<varname>n</varname>)</codeph>. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="strleft"> + + <dt> + <codeph>strleft(string a, int num_chars)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">strleft() function</indexterm> + <b>Purpose:</b> Returns the leftmost characters of the string. Shorthand for a call to + <codeph>substr()</codeph> with 2 arguments. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="strright"> + + <dt> + <codeph>strright(string a, int num_chars)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">strright() function</indexterm> + <b>Purpose:</b> Returns the rightmost characters of the string. Shorthand for a call to + <codeph>substr()</codeph> with 2 arguments. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="substr"> + + <dt> + <codeph>substr(string a, int start [, int len]), <ph id="substring">substring(string a, int start [, int + len])</ph></codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">substr() function</indexterm> + <b>Purpose:</b> Returns the portion of the string starting at a specified point, optionally with a + specified maximum length. The characters in the string are indexed starting at 1. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="translate"> + + <dt> + <codeph>translate(string input, string from, string to)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">translate() function</indexterm> + <b>Purpose:</b> Returns the input string with a set of characters replaced by another set of characters. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="trim"> + + <dt> + <codeph>trim(string a)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">trim() function</indexterm> + <b>Purpose:</b> Returns the input string with both leading and trailing spaces removed. The same as + passing the string through both <codeph>ltrim()</codeph> and <codeph>rtrim()</codeph>. + <p> + <b>Usage notes:</b> Often used during data cleansing operations during the ETL cycle, if input values might still have surrounding spaces. + For a more general-purpose function that can remove other leading and trailing characters besides spaces, see <codeph>btrim()</codeph>. + </p> + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="upper"> + + <dt> + <codeph>upper(string a), <ph id="ucase">ucase(string a)</ph></codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">upper() function</indexterm> + <indexterm audience="Cloudera">ucase() function</indexterm> + <b>Purpose:</b> Returns the argument string converted to all-uppercase. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + </dl> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_struct.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_struct.xml b/docs/topics/impala_struct.xml new file mode 100644 index 0000000..1e440fc --- /dev/null +++ b/docs/topics/impala_struct.xml @@ -0,0 +1,406 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> + + <concept id="struct"> + + <title>STRUCT Complex Type (CDH 5.5 or higher only)</title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + </metadata> + </prolog> + + <conbody> + + <p> + A complex data type, representing multiple fields of a single item. + Frequently used as the element type of an <codeph>ARRAY</codeph> + or the <codeph>VALUE</codeph> part of a <codeph>MAP</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock><varname>column_name</varname> STRUCT < <varname>name</varname> : <varname>type</varname> [COMMENT '<varname>comment_string</varname>'], ... > + +type ::= <varname>primitive_type</varname> | <varname>complex_type</varname> +</codeblock> + + <p> + The names and number of fields within the <codeph>STRUCT</codeph> are fixed. Each field can be a different type. + A field within a <codeph>STRUCT</codeph> can also be another <codeph>STRUCT</codeph>, or an <codeph>ARRAY</codeph> + or a <codeph>MAP</codeph>, allowing you to create nested data structures with a maximum nesting depth of 100. + </p> + + <p> + A <codeph>STRUCT</codeph> can be the top-level type for a column, or can itself be an item within an <codeph>ARRAY</codeph> + or the value part of the key-value pair in a <codeph>MAP</codeph>. + </p> + + <p> + When a <codeph>STRUCT</codeph> is used as an <codeph>ARRAY</codeph> element or a <codeph>MAP</codeph> value, + you use a join clause to bring the <codeph>ARRAY</codeph> or <codeph>MAP</codeph> elements into the result set, and then refer + to <codeph><varname>array_name</varname>.ITEM.<varname>field</varname></codeph> or + <codeph><varname>map_name</varname>.VALUE.<varname>field</varname></codeph>. + In the case of a <codeph>STRUCT</codeph> directly inside an <codeph>ARRAY</codeph> or <codeph>MAP</codeph>, + you can omit the <codeph>.ITEM</codeph> and <codeph>.VALUE</codeph> pseudocolumns and refer directly to + <codeph><varname>array_name</varname>.<varname>field</varname></codeph> or + <codeph><varname>map_name</varname>.<varname>field</varname></codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_combo"/> + + <p> + A <codeph>STRUCT</codeph> is similar conceptually to a table row: it contains a fixed number of named fields, + each with a predefined type. To combine two related tables, while using complex types to + minimize repetition, the typical way to represent that data is as an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> elements. + </p> + + <p> + Because a <codeph>STRUCT</codeph> has a fixed number of named fields, it typically does not make sense + to have a <codeph>STRUCT</codeph> as the type of a table column. In such a case, you could just make each field of the + <codeph>STRUCT</codeph> into a separate column of the table. The <codeph>STRUCT</codeph> type is most + useful as an item of an <codeph>ARRAY</codeph> or the value part of the key-value pair in a <codeph>MAP</codeph>. + A nested type column with a <codeph>STRUCT</codeph> at the lowest level lets you associate a variable + number of row-like objects with each row of the table. + </p> + + <p> + The <codeph>STRUCT</codeph> type is straightforward to reference within a query. You do not need to + include the <codeph>STRUCT</codeph> column in a join clause or give it a table alias, as is + required for the <codeph>ARRAY</codeph> and <codeph>MAP</codeph> types. You refer to the individual + fields using dot notation, such as <codeph><varname>struct_column_name</varname>.<varname>field_name</varname></codeph>, + without any pseudocolumn such as <codeph>ITEM</codeph> or <codeph>VALUE</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_describe"/> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + + <p> + Within the Parquet data file, the values for each <codeph>STRUCT</codeph> field are stored adjacent to each other, + so that they can be encoded and compressed using all the Parquet techniques for storing sets of similar or + repeated values. The adjacency applies even when the <codeph>STRUCT</codeph> values are part of an + <codeph>ARRAY</codeph> or <codeph>MAP</codeph>. During a query, Impala avoids unnecessary I/O by reading only the portions + of the Parquet data file containing the requested <codeph>STRUCT</codeph> fields. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_230"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <ul conref="../shared/impala_common.xml#common/complex_types_restrictions"> + <li/> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/> + + <p> + The following example shows a table with various kinds of <codeph>STRUCT</codeph> columns, + both at the top level and nested within other complex types. + Practice the <codeph>CREATE TABLE</codeph> and query notation for complex type columns + using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements reliably. + </p> + +<codeblock><![CDATA[CREATE TABLE struct_demo +( + id BIGINT, + name STRING, + +-- A STRUCT as a top-level column. Demonstrates how the table ID column +-- and the ID field within the STRUCT can coexist without a name conflict. + employee_info STRUCT < employer: STRING, id: BIGINT, address: STRING >, + +-- A STRUCT as the element type of an ARRAY. + places_lived ARRAY < STRUCT <street: STRING, city: STRING, country: STRING >>, + +-- A STRUCT as the value portion of the key-value pairs in a MAP. + memorable_moments MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>, + +-- A STRUCT where one of the fields is another STRUCT. + current_address STRUCT < street_address: STRUCT <street_number: INT, street_name: STRING, street_type: STRING>, country: STRING, postal_code: STRING > +) +STORED AS PARQUET; +]]> +</codeblock> + + <p> + The following example shows how to examine the structure of a table containing one or more + <codeph>STRUCT</codeph> columns by using the <codeph>DESCRIBE</codeph> statement. You can + visualize each <codeph>STRUCT</codeph> as its own table, with columns + named the same as each field of the <codeph>STRUCT</codeph>. + If the <codeph>STRUCT</codeph> is nested inside another complex type, such as <codeph>ARRAY</codeph>, + you can extend the qualified name passed to <codeph>DESCRIBE</codeph> until the output + shows just the <codeph>STRUCT</codeph> fields. + </p> + +<codeblock><![CDATA[DESCRIBE struct_demo; ++-------------------+--------------------------+ +| name | type | ++-------------------+--------------------------+ +| id | bigint | +| name | string | +| employee_info | struct< | +| | employer:string, | +| | id:bigint, | +| | address:string | +| | > | +| places_lived | array<struct< | +| | street:string, | +| | city:string, | +| | country:string | +| | >> | +| memorable_moments | map<string,struct< | +| | year:int, | +| | place:string, | +| | details:string | +| | >> | +| current_address | struct< | +| | street_address:struct< | +| | street_number:int, | +| | street_name:string, | +| | street_type:string | +| | >, | +| | country:string, | +| | postal_code:string | +| | > | ++-------------------+--------------------------+ + +DESCRIBE struct_demo.employee_info; ++----------+--------+ +| name | type | ++----------+--------+ +| employer | string | +| id | bigint | +| address | string | ++----------+--------+ + +-- Because PLACES_LIVED is a STRUCT inside an ARRAY, the +-- initial DESCRIBE shows the structure of the ARRAY. +DESCRIBE struct_demo.places_lived; ++------+------------------+ +| name | type | ++------+------------------+ +| item | struct< | +| | street:string, | +| | city:string, | +| | country:string | +| | > | +| pos | bigint | ++------+------------------+ + +-- Ask for the details of the ITEM field of the ARRAY to see +-- just the layout of the STRUCT. +DESCRIBE struct_demo.places_lived.item; ++---------+--------+ +| name | type | ++---------+--------+ +| street | string | +| city | string | +| country | string | ++---------+--------+ + +-- Likewise, MEMORABLE_MOMENTS has a STRUCT inside a MAP, +-- which requires an extra level of qualified name to see +-- just the STRUCT part. +DESCRIBE struct_demo.memorable_moments; ++-------+------------------+ +| name | type | ++-------+------------------+ +| key | string | +| value | struct< | +| | year:int, | +| | place:string, | +| | details:string | +| | > | ++-------+------------------+ + +-- For a MAP, ask to see the VALUE field to see the +-- corresponding STRUCT fields in a table-like structure. +DESCRIBE struct_demo.memorable_moments.value; ++---------+--------+ +| name | type | ++---------+--------+ +| year | int | +| place | string | +| details | string | ++---------+--------+ + +-- For a STRUCT inside a STRUCT, we can see the fields of the +-- outer STRUCT... +DESCRIBE struct_demo.current_address; ++----------------+-----------------------+ +| name | type | ++----------------+-----------------------+ +| street_address | struct< | +| | street_number:int, | +| | street_name:string, | +| | street_type:string | +| | > | +| country | string | +| postal_code | string | ++----------------+-----------------------+ + +-- ...and then use a further qualified name to see just the +-- fields of the inner STRUCT. +DESCRIBE struct_demo.current_address.street_address; ++---------------+--------+ +| name | type | ++---------------+--------+ +| street_number | int | +| street_name | string | +| street_type | string | ++---------------+--------+ +]]> +</codeblock> + + <p> + The following example shows how to examine the structure of a table containing one or more + <codeph>STRUCT</codeph> columns by using the <codeph>DESCRIBE</codeph> statement. You can + visualize each <codeph>STRUCT</codeph> as its own table, with columns + named the same as each field of the <codeph>STRUCT</codeph>. + If the <codeph>STRUCT</codeph> is nested inside another complex type, such as <codeph>ARRAY</codeph>, + you can extend the qualified name passed to <codeph>DESCRIBE</codeph> until the output + shows just the <codeph>STRUCT</codeph> fields. + </p> + +<!-- To do: See why the most verbose query form gives an error. --> +<codeblock><![CDATA[DESCRIBE struct_demo; ++-------------------+--------------------------+---------+ +| name | type | comment | ++-------------------+--------------------------+---------+ +| id | bigint | | +| name | string | | +| employee_info | struct< | | +| | employer:string, | | +| | id:bigint, | | +| | address:string | | +| | > | | +| places_lived | array<struct< | | +| | street:string, | | +| | city:string, | | +| | country:string | | +| | >> | | +| memorable_moments | map<string,struct< | | +| | year:int, | | +| | place:string, | | +| | details:string | | +| | >> | | +| current_address | struct< | | +| | street_address:struct< | | +| | street_number:int, | | +| | street_name:string, | | +| | street_type:string | | +| | >, | | +| | country:string, | | +| | postal_code:string | | +| | > | | ++-------------------+--------------------------+---------+ + +SELECT id, employee_info.id FROM struct_demo; + +SELECT id, employee_info.id AS employee_id FROM struct_demo; + +SELECT id, employee_info.id AS employee_id, employee_info.employer + FROM struct_demo; + +SELECT id, name, street, city, country + FROM struct_demo, struct_demo.places_lived; + +SELECT id, name, struct_demo.places_lived.pos, struct_demo.places_lived.street, struct_demo.places_lived.city, struct_demo.places_lived.country + FROM struct_demo, struct_demo.places_lived; +ERROR: AnalysisException: Illegal column/field reference 'struct_demo.places_lived.pos' with intermediate collection 'places_lived' of type 'ARRAY<STRUCT<street:STRING,city:STRING,country:STRING>>' + +SELECT id, name, pl.pos, pl.street, pl.city, pl.country + FROM struct_demo, struct_demo.places_lived AS pl; + +SELECT id, name, places_lived.pos, places_lived.street, places_lived.city, places_lived.country + FROM struct_demo, struct_demo.places_lived; + +SELECT id, name, pos, street, city, country + FROM struct_demo, struct_demo.places_lived; + +SELECT id, name, struct_demo.memorable_moments.key, + struct_demo.memorable_moments.value.year, + struct_demo.memorable_moments.value.place, + struct_demo.memorable_moments.value.details +FROM struct_demo, struct_demo.memorable_moments +WHERE struct_demo.memorable_moments.key IN ('Birthday','Anniversary','Graduation'); +ERROR: AnalysisException: Illegal column/field reference 'struct_demo.memorable_moments.key' with intermediate collection 'memorable_moments' of type 'MAP<STRING,STRUCT<year:INT,place:STRING,details:STRING>>' + +SELECT id, name, mm.key, mm.value.year, mm.value.place, mm.value.details + FROM struct_demo, struct_demo.memorable_moments AS mm +WHERE mm.key IN ('Birthday','Anniversary','Graduation'); + +SELECT id, name, memorable_moments.key, memorable_moments.value.year, + memorable_moments.value.place, memorable_moments.value.details +FROM struct_demo, struct_demo.memorable_moments +WHERE key IN ('Birthday','Anniversary','Graduation'); + +SELECT id, name, key, value.year, value.place, value.details + FROM struct_demo, struct_demo.memorable_moments +WHERE key IN ('Birthday','Anniversary','Graduation'); + +SELECT id, name, key, year, place, details + FROM struct_demo, struct_demo.memorable_moments +WHERE key IN ('Birthday','Anniversary','Graduation'); + +SELECT id, name, + current_address.street_address.street_number, + current_address.street_address.street_name, + current_address.street_address.street_type, + current_address.country, + current_address.postal_code +FROM struct_demo; +]]> +</codeblock> + + <p> + For example, this table uses a struct that encodes several data values for each phone number associated + with a person. Each person can have a variable-length array of associated phone numbers, and queries can + refer to the category field to locate specific home, work, mobile, and so on kinds of phone numbers. + </p> + +<codeblock>CREATE TABLE contact_info_many_structs +( + id BIGINT, name STRING, + phone_numbers ARRAY < STRUCT <category:STRING, country_code:STRING, area_code:SMALLINT, full_number:STRING, mobile:BOOLEAN, carrier:STRING > > +) STORED AS PARQUET; +</codeblock> + + <p> + Because structs are naturally suited to composite values where the fields have different data types, you might use them + to decompose things such as addresses: + </p> + +<codeblock>CREATE TABLE contact_info_detailed_address +( + id BIGINT, name STRING, + address STRUCT < house_number:INT, street:STRING, street_type:STRING, apartment:STRING, city:STRING, region:STRING, country:STRING > +); +</codeblock> + + <p> + In a big data context, splitting out data fields such as the number part of the address and the street name + could let you do analysis on each field independently. For example, which streets have the largest number + range of addresses, what are the statistical properties of the street names, which areas have a higher + proportion of <q>Roads</q>, <q>Courts</q> or <q>Boulevards</q>, and so on. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_complex_types.xml#complex_types"/>, + <xref href="impala_array.xml#array"/>, + <!-- <xref href="impala_struct.xml#struct"/>, --> + <xref href="impala_map.xml#map"/> + </p> + + </conbody> + + </concept> + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_subqueries.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_subqueries.xml b/docs/topics/impala_subqueries.xml new file mode 100644 index 0000000..ed99f3a --- /dev/null +++ b/docs/topics/impala_subqueries.xml @@ -0,0 +1,318 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="subqueries"> + + <title>Subqueries in Impala SELECT Statements</title> + <titlealts><navtitle>Subqueries</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">subqueries</indexterm> + A <term>subquery</term> is a query that is nested within another query. Subqueries let queries on one table + dynamically adapt based on the contents of another table. This technique provides great flexibility and + expressive power for SQL queries. + </p> + + <p> + A subquery can return a result set for use in the <codeph>FROM</codeph> or <codeph>WITH</codeph> clauses, or + with operators such as <codeph>IN</codeph> or <codeph>EXISTS</codeph>. + </p> + + <p> + A <term>scalar subquery</term> produces a result set with a single row containing a single column, typically + produced by an aggregation function such as <codeph>MAX()</codeph> or <codeph>SUM()</codeph>. This single + result value can be substituted in scalar contexts such as arguments to comparison operators. If the result + set is empty, the value of the scalar subquery is <codeph>NULL</codeph>. For example, the following query + finds the maximum value of <codeph>T2.Y</codeph> and then substitutes that value into the + <codeph>WHERE</codeph> clause of the outer block that queries <codeph>T1</codeph>: + </p> + +<codeblock>SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2); +</codeblock> + + <p> + <term>Uncorrelated subqueries</term> do not refer to any tables from the outer block of the query. The same + value or set of values produced by the subquery is used when evaluating each row from the outer query block. + In this example, the subquery returns an arbitrary number of values from <codeph>T2.Y</codeph>, and each + value of <codeph>T1.X</codeph> is tested for membership in that same set of values: + </p> + +<codeblock>SELECT x FROM t1 WHERE x IN (SELECT y FROM t2); +</codeblock> + + <p> + <term>Correlated subqueries</term> compare one or more values from the outer query block to values referenced + in the <codeph>WHERE</codeph> clause of the subquery. Each row evaluated by the outer <codeph>WHERE</codeph> + clause can be evaluated using a different set of values. These kinds of subqueries are restricted in the + kinds of comparisons they can do between columns of the inner and outer tables. (See the following + <b>Restrictions</b> item.) + </p> + + <p> + For example, the following query finds all the employees with salaries that are higher than average for their + department. The subquery potentially computes a different <codeph>AVG()</codeph> value for each employee. + </p> + +<!-- TK: Construct an EMPLOYEES schema to try out examples like these. --> + +<codeblock>SELECT employee_name, employee_id FROM employees one WHERE + salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id); +</codeblock> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + <b>Subquery in the <codeph>FROM</codeph> clause:</b> + </p> + +<codeblock>SELECT <varname>select_list</varname> FROM <varname>table_ref</varname> [, <varname>table_ref</varname> ...] + +<varname>table_ref</varname> ::= <varname>table_name</varname> | (<varname>select_statement</varname>) +</codeblock> + + <p> + <b>Subqueries in <codeph>WHERE</codeph> clause:</b> + </p> + +<codeblock>WHERE <varname>value</varname> <varname>comparison_operator</varname> (<varname>scalar_select_statement</varname>) +WHERE <varname>value</varname> [NOT] IN (<varname>select_statement</varname>) +WHERE [NOT] EXISTS (<varname>correlated_select_statement</varname>) +WHERE NOT EXISTS (<varname>correlated_select_statement</varname>) +</codeblock> + + <p> + <codeph>comparison_operator</codeph> is a numeric comparison such as <codeph>=</codeph>, + <codeph><=</codeph>, <codeph>!=</codeph>, and so on, or a string comparison operator such as + <codeph>LIKE</codeph> or <codeph>REGEXP</codeph>. + </p> + + <p rev="2.0.0"> + Although you can use non-equality comparison operators such as <codeph><</codeph> or + <codeph>>=</codeph>, the subquery must include at least one equality comparison between the columns of the + inner and outer query blocks. + </p> + + <p> + All syntax is available for both correlated and uncorrelated queries, except that the <codeph>NOT + EXISTS</codeph> clause cannot be used with an uncorrelated subquery. + </p> + + <p> + Impala subqueries can be nested arbitrarily deep. + </p> + + <p conref="../shared/impala_common.xml#common/sql1999"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + This example illustrates how subqueries can be used in the <codeph>FROM</codeph> clause to organize the table + names, column names, and column values by producing intermediate result sets, especially for join queries. + </p> + +<codeblock>SELECT avg(t1.x), max(t2.y) FROM + (SELECT id, cast(a AS DECIMAL(10,5)) AS x FROM raw_data WHERE a BETWEEN 0 AND 100) AS t1 + JOIN + (SELECT id, length(s) AS y FROM raw_data WHERE s LIKE 'A%') AS t2; + USING (id); +</codeblock> + + <p rev="2.0.0"> + These examples show how a query can test for the existence of values in a separate table using the + <codeph>EXISTS()</codeph> operator with a subquery. +<!-- +Internally, these queries are processed in a way similar to join queries. +Because the values from the second table are not part of the result set, the subquery +is more efficient than the equivalent join query. +--> + </p> + + <p> + The following examples show how a value can be compared against a set of values returned by a subquery. + </p> + +<codeblock rev="2.0.0">SELECT count(x) FROM t1 WHERE EXISTS(SELECT 1 FROM t2 WHERE t1.x = t2.y * 10); + +SELECT x FROM t1 WHERE x IN (SELECT y FROM t2 WHERE state = 'CA'); +</codeblock> + + <p> + The following examples demonstrate scalar subqueries. When a subquery is known to return a single value, you + can substitute it where you would normally put a constant value. + </p> + +<codeblock>SELECT x FROM t1 WHERE y = (SELECT max(z) FROM t2); +SELECT x FROM t1 WHERE y > (SELECT count(z) FROM t2); +</codeblock> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/partitioning_blurb"/> --> + +<!-- +<p conref="/Content/impala_common_xi44078.xml#common/hbase_blurb"/> +<p> +Currently, the <codeph>IN (<varname>subquery</varname>)</codeph> operator results in a full table scan +of an HBase table, rather than being translated into a series of single-row lookups. +Therefore, this is not an efficient construct to use with Impala queries for HBase tables. +</p> +--> + +<!-- +<p conref="/Content/impala_common_xi44078.xml#common/parquet_blurb"/> +<p conref="/Content/impala_common_xi44078.xml#common/text_blurb"/> +<p conref="/Content/impala_common_xi44078.xml#common/compatibility_blurb"/> +--> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + If the same table is referenced in both the outer and inner query blocks, construct a table alias in the + outer query block and use a fully qualified name to distinguish the inner and outer table references: + </p> + +<!-- TK: verify the logic of this example. Probably have other similar ones that could be reused here. --> + +<codeblock>SELECT * FROM t1 one WHERE id IN (SELECT parent FROM t1 two WHERE t1.parent = t2.id); +</codeblock> + + <p conref="../shared/impala_common.xml#common/internals_blurb"/> + + <p> + Internally, subqueries involving <codeph>IN</codeph>, <codeph>NOT IN</codeph>, <codeph>EXISTS</codeph>, or + <codeph>NOT EXISTS</codeph> clauses are rewritten into join queries. Depending on the syntax, the subquery + might be rewritten to an outer join, semi join, cross join, or anti join. + </p> + + <p> + A query is processed differently depending on whether the subquery calls any aggregation functions. There are + correlated and uncorrelated forms, with and without calls to aggregation functions. Each of these four + categories is rewritten differently. + </p> + + <p conref="../shared/impala_common.xml#common/column_stats_blurb"/> + + <p> + Because queries that include correlated and uncorrelated subqueries in the <codeph>WHERE</codeph> clause are + written into join queries, to achieve best performance, follow the same guidelines for running the + <codeph>COMPUTE STATS</codeph> statement as you do for tables involved in regular join queries. Run the + <codeph>COMPUTE STATS</codeph> statement for each associated tables after loading or substantially changing + the data in that table. See <xref href="impala_perf_stats.xml#perf_stats"/> for details. + </p> + + <p> + <b>Added in:</b> Subqueries are substantially enhanced starting in Impala 2.0 for CDH 4, and CDH 5.2.0. Now, + they can be used in the <codeph>WHERE</codeph> clause, in combination with clauses such as + <codeph>EXISTS</codeph> and <codeph>IN</codeph>, rather than just in the <codeph>FROM</codeph> clause. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p> + The initial Impala support for nested subqueries addresses the most common use cases. Some restrictions + remain: + </p> + + <ul> + <li> + <p> + Although you can use subqueries in a query involving <codeph>UNION</codeph> or <codeph>UNION ALL</codeph> + in Impala 2.1.0 and higher, currently you cannot construct a union of two subqueries (for example, in the + argument of an <codeph>IN</codeph> or <codeph>EXISTS</codeph> operator). + </p> + </li> + + <li> + <p> + Subqueries returning scalar values cannot be used with the operators <codeph>ANY</codeph> or + <codeph>ALL</codeph>. (Impala does not currently have a <codeph>SOME</codeph> operator, but if it did, + the same restriction would apply.) + </p> + </li> + + <li> + <p> + For the <codeph>EXISTS</codeph> and <codeph>NOT EXISTS</codeph> clauses, any subquery comparing values + from the outer query block to another table must use at least one equality comparison, not exclusively + other kinds of comparisons such as less than, greater than, <codeph>BETWEEN</codeph>, or + <codeph>!=</codeph>. + </p> + </li> + + <li> +<!-- TK: think this is no longer true. --> + <p> + Currently, a scalar subquery cannot be used as the first or second argument to the + <codeph>BETWEEN</codeph> operator. + </p> + </li> + + <li> + <p> + A subquery cannot be used inside an <codeph>OR</codeph> conjunction. Expressions inside a subquery, for + example in the <codeph>WHERE</codeph> clause, can use <codeph>OR</codeph> conjunctions; the restriction + only applies to parts of the query <q>above</q> the subquery. + </p> + </li> + + <li> + <p> + Scalar subqueries are only supported in numeric contexts. You cannot use a scalar subquery as an argument + to the <codeph>LIKE</codeph>, <codeph>REGEXP</codeph>, or <codeph>RLIKE</codeph> operators, or compare it + to a value of a non-numeric type such as <codeph>TIMESTAMP</codeph> or <codeph>BOOLEAN</codeph>. + </p> + </li> + + <li> + <p> +<!-- A subquery cannot be used to generate a scalar value for a function call. --> + You cannot use subqueries with the <codeph>CASE</codeph> function to generate the comparison value, the + values to be compared against, or the return value. + </p> + </li> + + <li> + <p> + A subquery is not allowed in the filter condition for the <codeph>HAVING</codeph> clause. (Strictly + speaking, a subquery cannot appear anywhere outside the <codeph>WITH</codeph>, <codeph>FROM</codeph>, and + <codeph>WHERE</codeph> clauses.) + </p> + </li> + + <li> + <p> + You must use a fully qualified name + (<codeph><varname>table_name</varname>.<varname>column_name</varname></codeph> or + <codeph><varname>database_name</varname>.<varname>table_name</varname>.<varname>column_name</varname></codeph>) + when referring to any column from the outer query block within a subquery. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + For the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) + available in CDH 5.5 / Impala 2.3 and higher, the join queries that <q>unpack</q> complex type + columns often use correlated subqueries in the <codeph>FROM</codeph> clause. + For example, if the first table in the join clause is <codeph>CUSTOMER</codeph>, the second + join clause might have a subquery that selects from the column <codeph>CUSTOMER.C_ORDERS</codeph>, + which is an <codeph>ARRAY</codeph>. The subquery re-evaluates the <codeph>ARRAY</codeph> elements + corresponding to each row from the <codeph>CUSTOMER</codeph> table. + See <xref href="impala_complex_types.xml#complex_types"/> for details and examples of + using subqueries with complex types. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_operators.xml#exists"/>, <xref href="impala_operators.xml#in"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_sum.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_sum.xml b/docs/topics/impala_sum.xml new file mode 100644 index 0000000..6d25f1c --- /dev/null +++ b/docs/topics/impala_sum.xml @@ -0,0 +1,236 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="sum"> + + <title>SUM Function</title> + <titlealts><navtitle>SUM</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Analytic Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">sum() function</indexterm> + An aggregate function that returns the sum of a set of numbers. Its single argument can be numeric column, or + the numeric result of a function or expression applied to the column value. Rows with a <codeph>NULL</codeph> + value for the specified column are ignored. If the table is empty, or all the values supplied to + <codeph>MIN</codeph> are <codeph>NULL</codeph>, <codeph>SUM</codeph> returns <codeph>NULL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>SUM([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock> + + <p> + When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of + grouping values. + </p> + + <p> + <b>Return type:</b> <codeph>BIGINT</codeph> for integer arguments, <codeph>DOUBLE</codeph> for floating-point + arguments + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how to use <codeph>SUM()</codeph> to compute the total for all the values in the + table, a subset of values, or the sum for each combination of values in the <codeph>GROUP BY</codeph> clause: + </p> + +<codeblock>-- Total all the values for this column in the table. +select sum(c1) from t1; +-- Find the total for this column from a subset of the table. +select sum(c1) from t1 where month = 'January' and year = '2013'; +-- Find the total from a set of numeric function results. +select sum(length(s)) from t1; +-- Often used with functions that return predefined values to compute a score. +select sum(case when grade = 'A' then 1.0 when grade = 'B' then 0.75 else 0) as class_honors from test_scores; +-- Can also be used in combination with DISTINCT and/or GROUP BY. +-- Return more than one result. +select month, year, sum(purchase_price) from store_stats group by month, year; +-- Filter the input to eliminate duplicates before performing the calculation. +select sum(distinct x) from t1; +</codeblock> + + <p rev="2.0.0"> + The following examples show how to use <codeph>SUM()</codeph> in an analytic context. They use a table + containing integers from 1 to 10. Notice how the <codeph>SUM()</codeph> is reported for each input value, as + opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set. +<codeblock>select x, property, sum(x) <b>over (partition by property)</b> as sum from int_t where property in ('odd','even'); ++----+----------+-----+ +| x | property | sum | ++----+----------+-----+ +| 2 | even | 30 | +| 4 | even | 30 | +| 6 | even | 30 | +| 8 | even | 30 | +| 10 | even | 30 | +| 1 | odd | 25 | +| 3 | odd | 25 | +| 5 | odd | 25 | +| 7 | odd | 25 | +| 9 | odd | 25 | ++----+----------+-----+ +</codeblock> + +Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving +set of rows (the <q>window</q>). The following examples use <codeph>SUM()</codeph> in an analytic context +(that is, with an <codeph>OVER()</codeph> clause) to produce a running total of all the even values, +then a running total of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly +activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +therefore all of these examples produce the same results: +<codeblock>select x, property, + sum(x) over (partition by property <b>order by x</b>) as 'cumulative total' + from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative total | ++----+----------+------------------+ +| 2 | even | 2 | +| 4 | even | 6 | +| 6 | even | 12 | +| 8 | even | 20 | +| 10 | even | 30 | +| 1 | odd | 1 | +| 3 | odd | 4 | +| 5 | odd | 9 | +| 7 | odd | 16 | +| 9 | odd | 25 | ++----+----------+------------------+ + +select x, property, + sum(x) over + ( + partition by property + <b>order by x</b> + <b>range between unbounded preceding and current row</b> + ) as 'cumulative total' +from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative total | ++----+----------+------------------+ +| 2 | even | 2 | +| 4 | even | 6 | +| 6 | even | 12 | +| 8 | even | 20 | +| 10 | even | 30 | +| 1 | odd | 1 | +| 3 | odd | 4 | +| 5 | odd | 9 | +| 7 | odd | 16 | +| 9 | odd | 25 | ++----+----------+------------------+ + +select x, property, + sum(x) over + ( + partition by property + <b>order by x</b> + <b>rows between unbounded preceding and current row</b> + ) as 'cumulative total' + from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative total | ++----+----------+------------------+ +| 2 | even | 2 | +| 4 | even | 6 | +| 6 | even | 12 | +| 8 | even | 20 | +| 10 | even | 30 | +| 1 | odd | 1 | +| 3 | odd | 4 | +| 5 | odd | 9 | +| 7 | odd | 16 | +| 9 | odd | 25 | ++----+----------+------------------+ +</codeblock> + +Changing the direction of the <codeph>ORDER BY</codeph> clause causes the intermediate +results of the cumulative total to be calculated in a different order: + +<codeblock>select sum(x) over (partition by property <b>order by x desc</b>) as 'cumulative total' + from int_t where property in ('odd','even'); ++----+----------+------------------+ +| x | property | cumulative total | ++----+----------+------------------+ +| 10 | even | 10 | +| 8 | even | 18 | +| 6 | even | 24 | +| 4 | even | 28 | +| 2 | even | 30 | +| 9 | odd | 9 | +| 7 | odd | 16 | +| 5 | odd | 21 | +| 3 | odd | 24 | +| 1 | odd | 25 | ++----+----------+------------------+ +</codeblock> + +The following examples show how to construct a moving window, with a running total taking into account 1 row before +and 1 row after the current row, within the same partition (all the even values or all the odd values). +Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of +moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph> +clause: +<codeblock>select x, property, + sum(x) over + ( + partition by property + <b>order by x</b> + <b>rows between 1 preceding and 1 following</b> + ) as 'moving total' + from int_t where property in ('odd','even'); ++----+----------+--------------+ +| x | property | moving total | ++----+----------+--------------+ +| 2 | even | 6 | +| 4 | even | 12 | +| 6 | even | 18 | +| 8 | even | 24 | +| 10 | even | 18 | +| 1 | odd | 4 | +| 3 | odd | 9 | +| 5 | odd | 15 | +| 7 | odd | 21 | +| 9 | odd | 16 | ++----+----------+--------------+ + +-- Doesn't work because of syntax restriction on RANGE clause. +select x, property, + sum(x) over + ( + partition by property + <b>order by x</b> + <b>range between 1 preceding and 1 following</b> + ) as 'moving total' +from int_t where property in ('odd','even'); +ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW. +</codeblock> + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#analytic_functions"/> + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + +<!-- This conref appears under SUM(), AVG(), FLOAT, and DOUBLE topics. --> + + <p conref="../shared/impala_common.xml#common/sum_double"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_support_start_over.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_support_start_over.xml b/docs/topics/impala_support_start_over.xml new file mode 100644 index 0000000..2c17b5d --- /dev/null +++ b/docs/topics/impala_support_start_over.xml @@ -0,0 +1,29 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="support_start_over"> + + <title>SUPPORT_START_OVER Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SUPPORT_START_OVER query option</indexterm> + Leave this setting at its default value. + It is a read-only setting, tested by some client applications such as Hue. + </p> + <p> + If you accidentally change it through <cmdname>impala-shell</cmdname>, + subsequent queries encounter errors until you undo the change + by issuing <codeph>UNSET support_start_over</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_sync_ddl.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_sync_ddl.xml b/docs/topics/impala_sync_ddl.xml new file mode 100644 index 0000000..b217f67 --- /dev/null +++ b/docs/topics/impala_sync_ddl.xml @@ -0,0 +1,56 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="sync_ddl"> + + <title>SYNC_DDL Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SYNC_DDL query option</indexterm> + When enabled, causes any DDL operation such as <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> + to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala + catalog service. That way, if you issue a subsequent <codeph>CONNECT</codeph> statement in + <cmdname>impala-shell</cmdname> to connect to a different node in the cluster, you can be sure that other + node will already recognize any added or changed tables. (The catalog service automatically broadcasts the + DDL changes to all nodes automatically, but without this option there could be a period of inconsistency if + you quickly switched to another node, such as by issuing a subsequent query through a load-balancing proxy.) + </p> + + <p> + Although <codeph>INSERT</codeph> is classified as a DML statement, when the <codeph>SYNC_DDL</codeph> option + is enabled, <codeph>INSERT</codeph> statements also delay their completion until all the underlying data and + metadata changes are propagated to all Impala nodes. Internally, Impala inserts have similarities with DDL + statements in traditional database systems, because they create metadata needed to track HDFS block locations + for new files and they potentially add new partitions to partitioned tables. + </p> + + <note> + Because this option can introduce a delay after each write operation, if you are running a sequence of + <codeph>CREATE DATABASE</codeph>, <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, + <codeph>INSERT</codeph>, and similar statements within a setup script, to minimize the overall delay you can + enable the <codeph>SYNC_DDL</codeph> query option only near the end, before the final DDL statement. + </note> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <draft-comment translate="no"> +Example could be useful here. +</draft-comment> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_ddl.xml#ddl"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_tables.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_tables.xml b/docs/topics/impala_tables.xml new file mode 100644 index 0000000..30e3737 --- /dev/null +++ b/docs/topics/impala_tables.xml @@ -0,0 +1,258 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="tables"> + + <title>Overview of Impala Tables</title> + <titlealts><navtitle>Tables</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Databases"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p/> + + <p> + Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to + other database systems, plus some features such as partitioning often associated with higher-end data + warehouse systems. + </p> + + <p> + Logically, each table has a structure based on the definition of its columns, partitions, and other + properties. + </p> + + <p> + Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files + underneath that directory: + </p> + + <ul> + <li> + <xref href="impala_tables.xml#internal_tables">Internal tables</xref> are managed by Impala, and use directories + inside the designated Impala work area. + </li> + + <li> + <xref href="impala_tables.xml#external_tables">External tables</xref> use arbitrary HDFS directories, where + the data files are typically shared between different Hadoop components. + </li> + + <li> + Large-scale data is usually handled by partitioned tables, where the data files are divided among different + HDFS subdirectories. + </li> + </ul> + + <p rev="2.2.0"> + Impala tables can also represent data that is stored in HBase, or in the Amazon S3 filesystem (CDH 5.4.0 or higher), + or on Isilon storage devices (CDH 5.4.3 or higher). See <xref href="impala_hbase.xml#impala_hbase"/>, + <xref href="impala_s3.xml#s3"/>, and <xref href="impala_isilon.xml#impala_isilon"/> + for details about those special kinds of tables. + </p> + + <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> + + <p> + <b>Related statements:</b> <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/> + <xref href="impala_insert.xml#insert"/>, <xref href="impala_load_data.xml#load_data"/>, + <xref href="impala_select.xml#select"/> + </p> + </conbody> + + <concept id="internal_tables"> + + <title>Internal Tables</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">internal tables</indexterm> + The default kind of table produced by the <codeph>CREATE TABLE</codeph> statement is known as an internal + table. (Its counterpart is the external table, produced by the <codeph>CREATE EXTERNAL TABLE</codeph> + syntax.) + </p> + + <ul> + <li> + <p> + Impala creates a directory in HDFS to hold the data files. + </p> + </li> + + <li> + <p> + You can create data in internal tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> + statements. + </p> + </li> + + <li> + <p> + If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in + <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, + and so on. + </p> + </li> + + <li> + <p> + When you issue a <codeph>DROP TABLE</codeph> statement, Impala physically removes all the data files + from the directory. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> + </li> + + <li> + <p> + When you issue an <codeph>ALTER TABLE</codeph> statement to rename an internal table, all data files + are moved into the new HDFS directory for the table. The files are moved even if they were formerly in + a directory outside the Impala data directory, for example in an internal table with a + <codeph>LOCATION</codeph> attribute pointing to an outside HDFS directory. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#external_tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, + <xref href="impala_describe.xml#describe"/> + </p> + </conbody> + </concept> + + <concept id="external_tables"> + + <title>External Tables</title> + + <conbody> + + <p> + <indexterm audience="Cloudera">external tables</indexterm> + The syntax <codeph>CREATE EXTERNAL TABLE</codeph> sets up an Impala table that points at existing data + files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the + expense of importing the data into a new table when you already have the data files in a known location in + HDFS, in the desired file format. + </p> + + <ul> + <li> + <p> + You can use Impala to query the data in this table. + </p> + </li> + + <li> + <p> + You can create data in external tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> + statements. + </p> + </li> + + <li> + <p> + If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in + <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, + and so on. + </p> + </li> + + <li> + <p> + When you issue a <codeph>DROP TABLE</codeph> statement in Impala, that removes the connection that + Impala has with the associated data files, but does not physically remove the underlying data. You can + continue to use the data files with other Hadoop components and HDFS operations. + </p> + </li> + + <li> + <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> + </li> + + <li> + <p> + When you issue an <codeph>ALTER TABLE</codeph> statement to rename an external table, all data files + are left in their original locations. + </p> + </li> + + <li> + <p> + You can point multiple external tables at the same HDFS directory by using the same + <codeph>LOCATION</codeph> attribute for each one. The tables could have different column definitions, + as long as the number and types of columns are compatible with the schema evolution considerations for + the underlying file type. For example, for text data files, one table might define a certain column as + a <codeph>STRING</codeph> while another defines the same column as a <codeph>BIGINT</codeph>. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#internal_tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, + <xref href="impala_describe.xml#describe"/> + </p> + </conbody> + </concept> + + <concept id="table_file_formats"> + <title>File Formats</title> + <conbody> + <p> + Each table has an associated file format, which determines how Impala interprets the + associated data files. See <xref href="impala_file_formats.xml#file_formats"/> for details. + </p> + <p> + You set the file format during the <codeph>CREATE TABLE</codeph> statement, + or change it later using the <codeph>ALTER TABLE</codeph> statement. + Partitioned tables can have a different file format for individual partitions, + allowing you to change the file format used in your ETL process for new data + without going back and reconverting all the existing data in the same table. + </p> + <p> + Any <codeph>INSERT</codeph> statements produce new data files with the current file format of the table. + For existing data files, changing the file format of the table does not automatically do any data conversion. + You must use <codeph>TRUNCATE TABLE</codeph> or <codeph>INSERT OVERWRITE</codeph> to remove any previous data + files that use the old file format. + Then you use the <codeph>LOAD DATA</codeph> statement, <codeph>INSERT ... SELECT</codeph>, or other mechanism + to put data files of the correct format into the table. + </p> + <p> + The default file format, text, is the most flexible and easy to produce when you are just getting started with + Impala. The Parquet file format offers the highest query performance and uses compression to reduce storage + requirements; therefore, Cloudera recommends using Parquet for Impala tables with substantial amounts of data. + <ph rev="2.3.0">Also, the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) + available in CDH 5.5 / Impala 2.3 and higher are currently only supported with the Parquet file type.</ph> + Based on your existing ETL workflow, you might use other file formats such as Avro, possibly doing a final + conversion step to Parquet to take advantage of its performance for analytic queries. + </p> + </conbody> + </concept> + +</concept>
