http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_decimal.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_decimal.xml b/docs/topics/impala_decimal.xml new file mode 100644 index 0000000..c0c98d9 --- /dev/null +++ b/docs/topics/impala_decimal.xml @@ -0,0 +1,836 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.4.0" id="decimal"> + + <title>DECIMAL Data Type (CDH 5.1 or higher only)</title> + <titlealts><navtitle>DECIMAL (CDH 5.1 or higher only)</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + A numeric data type with fixed scale and precision, used in <codeph>CREATE TABLE</codeph> and <codeph>ALTER + TABLE</codeph> statements. Suitable for financial and other arithmetic calculations where the imprecise + representation and rounding behavior of <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> make those types + impractical. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + + <p> + In the column definition of a <codeph>CREATE TABLE</codeph> statement: + </p> + +<codeblock><varname>column_name</varname> DECIMAL[(<varname>precision</varname>[,<varname>scale</varname>])]</codeblock> + + <p> + <codeph>DECIMAL</codeph> with no precision or scale values is equivalent to <codeph>DECIMAL(9,0)</codeph>. + </p> + + <p> + <b>Precision and Scale:</b> + </p> + + <p> + <varname>precision</varname> represents the total number of digits that can be represented by the column, + regardless of the location of the decimal point. This value must be between 1 and 38. For example, + representing integer values up to 9999, and floating-point values up to 99.99, both require a precision of 4. + You can also represent corresponding negative values, without any change in the precision. For example, the + range -9999 to 9999 still only requires a precision of 4. + </p> + + <p> + <varname>scale</varname> represents the number of fractional digits. This value must be less than or equal to + <varname>precision</varname>. A scale of 0 produces integral values, with no fractional part. If precision + and scale are equal, all the digits come after the decimal point, making all the values between 0 and + 0.999... or 0 and -0.999... + </p> + + <p> + When <varname>precision</varname> and <varname>scale</varname> are omitted, a <codeph>DECIMAL</codeph> value + is treated as <codeph>DECIMAL(9,0)</codeph>, that is, an integer value ranging from + <codeph>-999,999,999</codeph> to <codeph>999,999,999</codeph>. This is the largest <codeph>DECIMAL</codeph> + value that can still be represented in 4 bytes. If precision is specified but scale is omitted, Impala uses a + value of zero for the scale. + </p> + + <p> + Both <varname>precision</varname> and <varname>scale</varname> must be specified as integer literals, not any + other kind of constant expressions. + </p> + + <p> + To check the precision or scale for arbitrary values, you can call the + <xref href="impala_math_functions.xml#math_functions"><codeph>precision()</codeph> and + <codeph>scale()</codeph> built-in functions</xref>. For example, you might use these values to figure out how + many characters are required for various fields in a report, or to understand the rounding characteristics of + a formula as applied to a particular <codeph>DECIMAL</codeph> column. + </p> + + <p> + <b>Range:</b> + </p> + + <p> + The maximum precision value is 38. Thus, the largest integral value is represented by + <codeph>DECIMAL(38,0)</codeph> (999... with 9 repeated 38 times). The most precise fractional value (between + 0 and 1, or 0 and -1) is represented by <codeph>DECIMAL(38,38)</codeph>, with 38 digits to the right of the + decimal point. The value closest to 0 would be .0000...1 (37 zeros and the final 1). The value closest to 1 + would be .999... (9 repeated 38 times). + </p> + + <p> + For a given precision and scale, the range of <codeph>DECIMAL</codeph> values is the same in the positive and + negative directions. For example, <codeph>DECIMAL(4,2)</codeph> can represent from -99.99 to 99.99. This is + different from other integral numeric types where the positive and negative bounds differ slightly. + </p> + + <p> + When you use <codeph>DECIMAL</codeph> values in arithmetic expressions, the precision and scale of the result + value are determined as follows: + </p> + + <ul> + <li> + <p> + For addition and subtraction, the precision and scale are based on the maximum possible result, that is, + if all the digits of the input values were 9s and the absolute values were added together. + </p> +<!-- Seems like buggy output from this first query, so hiding the example for the time being. --> +<codeblock audience="Cloudera">[localhost:21000] > select 50000.5 + 12.444, precision(50000.5 + 12.444), scale(50000.5 + 12.444); ++------------------+-----------------------------+-------------------------+ +| 50000.5 + 12.444 | precision(50000.5 + 12.444) | scale(50000.5 + 12.444) | ++------------------+-----------------------------+-------------------------+ +| 50012.944 | 9 | 3 | ++------------------+-----------------------------+-------------------------+ +[localhost:21000] > select 99999.9 + 99.999, precision(99999.9 + 99.999), scale(99999.9 + 99.999); ++------------------+-----------------------------+-------------------------+ +| 99999.9 + 99.999 | precision(99999.9 + 99.999) | scale(99999.9 + 99.999) | ++------------------+-----------------------------+-------------------------+ +| 100099.899 | 9 | 3 | ++------------------+-----------------------------+-------------------------+ +</codeblock> + </li> + + <li> + <p> + For multiplication, the precision is the sum of the precisions of the input values. The scale is the sum + of the scales of the input values. + </p> + </li> + +<!-- Need to add some specifics to discussion of division. Details here: http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/29/564110.aspx --> + + <li> + <p> + For division, Impala sets the precision and scale to values large enough to represent the whole and + fractional parts of the result. + </p> + </li> + + <li> + <p> + For <codeph>UNION</codeph>, the scale is the larger of the scales of the input values, and the precision + is increased if necessary to accommodate any additional fractional digits. If the same input value has + the largest precision and the largest scale, the result value has the same precision and scale. If one + value has a larger precision but smaller scale, the scale of the result value is increased. For example, + <codeph>DECIMAL(20,2) UNION DECIMAL(8,6)</codeph> produces a result of type + <codeph>DECIMAL(24,6)</codeph>. The extra 4 fractional digits of scale (6-2) are accommodated by + extending the precision by the same amount (20+4). + </p> + </li> + + <li> + <p> + To doublecheck, you can always call the <codeph>PRECISION()</codeph> and <codeph>SCALE()</codeph> + functions on the results of an arithmetic expression to see the relevant values, or use a <codeph>CREATE + TABLE AS SELECT</codeph> statement to define a column based on the return type of the expression. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <ul> + <li> + Using the <codeph>DECIMAL</codeph> type is only supported under CDH 5.1.0 and higher. +<!-- + Although Impala-created tables containing <codeph>DECIMAL</codeph> columns are + readable in CDH 5.1, <codeph>DECIMAL</codeph> data is not interoperable with + other Hadoop components in CDH 4, and some Impala operations such as + <codeph>COMPUTE STATS</codeph> are not possible on such tables in CDH 4. + If you create a Parquet table with a <codeph>DECIMAL</codeph> + column under CDH 4, Impala issues a warning because the data files might not be readable from other CDH 4 components. +--> + </li> + +<!-- + <li> + The <codeph>DECIMAL</codeph> data type is a relatively new addition to the + Parquet file format. To read Impala-created Parquet files containing + <codeph>DECIMAL</codeph> columns from another Hadoop component such as + MapReduce, Pig, or Hive, use CDH 5.1 or higher, or the equivalent levels of the relevant components and Parquet + JARs from CDH 5.1. + If you create a Parquet table with a <codeph>DECIMAL</codeph> + column under CDH 4, Impala issues a warning because the data files might not be readable from other CDH 4 components. + </li> + + <li> + In particular, Impala-created tables with <codeph>DECIMAL</codeph> columns are + not readable by Hive under CDH 4. + </li> +--> + + <li> + Use the <codeph>DECIMAL</codeph> data type in Impala for applications where you used the + <codeph>NUMBER</codeph> data type in Oracle. The Impala <codeph>DECIMAL</codeph> type does not support the + Oracle idioms of <codeph>*</codeph> for scale or negative values for precision. + </li> + </ul> + + <p> + <b>Conversions and casting:</b> + </p> + + <p> + <ph conref="../shared/impala_common.xml#common/cast_int_to_timestamp"/> + </p> + + <p> + Impala automatically converts between <codeph>DECIMAL</codeph> and other numeric types where possible. A + <codeph>DECIMAL</codeph> with zero scale is converted to or from the smallest appropriate integral type. A + <codeph>DECIMAL</codeph> with a fractional part is automatically converted to or from the smallest + appropriate floating-point type. If the destination type does not have sufficient precision or scale to hold + all possible values of the source type, Impala raises an error and does not convert the value. + </p> + + <p> + For example, these statements show how expressions of <codeph>DECIMAL</codeph> and other types are reconciled + to the same type in the context of <codeph>UNION</codeph> queries and <codeph>INSERT</codeph> statements: + </p> + +<codeblock>[localhost:21000] > select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x; ++----------------+ +| x | ++----------------+ +| 1.5000 | +| 1.0000 | ++----------------+ +[localhost:21000] > create table int_vs_decimal as select cast(1 as int) as x union select cast(1.5 as decimal(9,4)) as x; ++-------------------+ +| summary | ++-------------------+ +| Inserted 2 row(s) | ++-------------------+ +[localhost:21000] > desc int_vs_decimal; ++------+---------------+---------+ +| name | type | comment | ++------+---------------+---------+ +| x | decimal(14,4) | | ++------+---------------+---------+ +</codeblock> + + <p> + To avoid potential conversion errors, you can use <codeph>CAST()</codeph> to convert <codeph>DECIMAL</codeph> + values to <codeph>FLOAT</codeph>, <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, + <codeph>BIGINT</codeph>, <codeph>STRING</codeph>, <codeph>TIMESTAMP</codeph>, or <codeph>BOOLEAN</codeph>. + You can use exponential notation in <codeph>DECIMAL</codeph> literals or when casting from + <codeph>STRING</codeph>, for example <codeph>1.0e6</codeph> to represent one million. + </p> + + <p> + If you cast a value with more fractional digits than the scale of the destination type, any extra fractional + digits are truncated (not rounded). Casting a value to a target type with not enough precision produces a + result of <codeph>NULL</codeph> and displays a runtime warning. + </p> + +<codeblock>[localhost:21000] > select cast(1.239 as decimal(3,2)); ++-----------------------------+ +| cast(1.239 as decimal(3,2)) | ++-----------------------------+ +| 1.23 | ++-----------------------------+ +[localhost:21000] > select cast(1234 as decimal(3)); ++----------------------------+ +| cast(1234 as decimal(3,0)) | ++----------------------------+ +| NULL | ++----------------------------+ +WARNINGS: Expression overflowed, returning NULL +</codeblock> + + <p> + When you specify integer literals, for example in <codeph>INSERT ... VALUES</codeph> statements or arithmetic + expressions, those numbers are interpreted as the smallest applicable integer type. You must use + <codeph>CAST()</codeph> calls for some combinations of integer literals and <codeph>DECIMAL</codeph> + precision. For example, <codeph>INT</codeph> has a maximum value that is 10 digits long, + <codeph>TINYINT</codeph> has a maximum value that is 3 digits long, and so on. If you specify a value such as + 123456 to go into a <codeph>DECIMAL</codeph> column, Impala checks if the column has enough precision to + represent the largest value of that integer type, and raises an error if not. Therefore, use an expression + like <codeph>CAST(123456 TO DECIMAL(9,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 9 or + less, <codeph>CAST(50 TO DECIMAL(2,0))</codeph> for <codeph>DECIMAL</codeph> columns with precision 2 or + less, and so on. For <codeph>DECIMAL</codeph> columns with precision 10 or greater, Impala automatically + interprets the value as the correct <codeph>DECIMAL</codeph> type; however, because + <codeph>DECIMAL(10)</codeph> requires 8 bytes of storage while <codeph>DECIMAL(9)</codeph> requires only 4 + bytes, only use precision of 10 or higher when actually needed. + </p> + +<codeblock>[localhost:21000] > create table decimals_9_0 (x decimal); +[localhost:21000] > insert into decimals_9_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000); +ERROR: AnalysisException: Possible loss of precision for target table 'decimal_testing.decimals_9_0'. +Expression '1' (type: INT) would need to be cast to DECIMAL(9,0) for column 'x' +[localhost:21000] > insert into decimals_9_0 values (cast(1 as decimal)), (cast(2 as decimal)), (cast(4 as decimal)), (cast(8 as decimal)), (cast(16 as decimal)), (cast(1024 as decimal)), (cast(32768 as decimal)), (cast(65536 as decimal)), (cast(1000000 as decimal)); + +[localhost:21000] > create table decimals_10_0 (x decimal(10,0)); +[localhost:21000] > insert into decimals_10_0 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000); +[localhost:21000] > +</codeblock> + + <p> + Be aware that in memory and for binary file formats such as Parquet or Avro, <codeph>DECIMAL(10)</codeph> or + higher consumes 8 bytes while <codeph>DECIMAL(9)</codeph> (the default for <codeph>DECIMAL</codeph>) or lower + consumes 4 bytes. Therefore, to conserve space in large tables, use the smallest-precision + <codeph>DECIMAL</codeph> type that is appropriate and <codeph>CAST()</codeph> literal values where necessary, + rather than declaring <codeph>DECIMAL</codeph> columns with high precision for convenience. + </p> + + <p> + To represent a very large or precise <codeph>DECIMAL</codeph> value as a literal, for example one that + contains more digits than can be represented by a <codeph>BIGINT</codeph> literal, use a quoted string or a + floating-point value for the number, and <codeph>CAST()</codeph> to the desired <codeph>DECIMAL</codeph> + type: + </p> + +<codeblock>insert into decimals_38_5 values (1), (2), (4), (8), (16), (1024), (32768), (65536), (1000000), + (cast("999999999999999999999999999999" as decimal(38,5))), + (cast(999999999999999999999999999999. as decimal(38,5))); +</codeblock> + + <ul> + <li> + <p> + The result of an aggregate function such as <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, or + <codeph>AVG()</codeph> on <codeph>DECIMAL</codeph> values is promoted to a scale of 38, with the same + precision as the underlying column. Thus, the result can represent the largest possible value at that + particular precision. + </p> + </li> + + <li> + <p> + <codeph>STRING</codeph> columns, literals, or expressions can be converted to <codeph>DECIMAL</codeph> as + long as the overall number of digits and digits to the right of the decimal point fit within the + specified precision and scale for the declared <codeph>DECIMAL</codeph> type. By default, a + <codeph>DECIMAL</codeph> value with no specified scale or precision can hold a maximum of 9 digits of an + integer value. If there are more digits in the string value than are allowed by the + <codeph>DECIMAL</codeph> scale and precision, the result is <codeph>NULL</codeph>. + </p> + <p> + The following examples demonstrate how <codeph>STRING</codeph> values with integer and fractional parts + are represented when converted to <codeph>DECIMAL</codeph>. If the scale is 0, the number is treated + as an integer value with a maximum of <varname>precision</varname> digits. If the precision is greater than + 0, the scale must be increased to account for the digits both to the left and right of the decimal point. + As the precision increases, output values are printed with additional trailing zeros after the decimal + point if needed. Any trailing zeros after the decimal point in the <codeph>STRING</codeph> value must fit + within the number of digits specified by the precision. + </p> +<codeblock>[localhost:21000] > select cast('100' as decimal); -- Small integer value fits within 9 digits of scale. ++-----------------------------+ +| cast('100' as decimal(9,0)) | ++-----------------------------+ +| 100 | ++-----------------------------+ +[localhost:21000] > select cast('100' as decimal(3,0)); -- Small integer value fits within 3 digits of scale. ++-----------------------------+ +| cast('100' as decimal(3,0)) | ++-----------------------------+ +| 100 | ++-----------------------------+ +[localhost:21000] > select cast('100' as decimal(2,0)); -- 2 digits of scale is not enough! ++-----------------------------+ +| cast('100' as decimal(2,0)) | ++-----------------------------+ +| NULL | ++-----------------------------+ +[localhost:21000] > select cast('100' as decimal(3,1)); -- (3,1) = 2 digits left of the decimal point, 1 to the right. Not enough. ++-----------------------------+ +| cast('100' as decimal(3,1)) | ++-----------------------------+ +| NULL | ++-----------------------------+ +[localhost:21000] > select cast('100' as decimal(4,1)); -- 4 digits total, 1 to the right of the decimal point. ++-----------------------------+ +| cast('100' as decimal(4,1)) | ++-----------------------------+ +| 100.0 | ++-----------------------------+ +[localhost:21000] > select cast('98.6' as decimal(3,1)); -- (3,1) can hold a 3 digit number with 1 fractional digit. ++------------------------------+ +| cast('98.6' as decimal(3,1)) | ++------------------------------+ +| 98.6 | ++------------------------------+ +[localhost:21000] > select cast('98.6' as decimal(15,1)); -- Larger scale allows bigger numbers but still only 1 fractional digit. ++-------------------------------+ +| cast('98.6' as decimal(15,1)) | ++-------------------------------+ +| 98.6 | ++-------------------------------+ +[localhost:21000] > select cast('98.6' as decimal(15,5)); -- Larger precision allows more fractional digits, outputs trailing zeros. ++-------------------------------+ +| cast('98.6' as decimal(15,5)) | ++-------------------------------+ +| 98.60000 | ++-------------------------------+ +[localhost:21000] > select cast('98.60000' as decimal(15,1)); -- Trailing zeros in the string must fit within 'scale' digits (1 in this case). ++-----------------------------------+ +| cast('98.60000' as decimal(15,1)) | ++-----------------------------------+ +| NULL | ++-----------------------------------+ +</codeblock> + </li> + + <li> + Most built-in arithmetic functions such as <codeph>SIN()</codeph> and <codeph>COS()</codeph> continue to + accept only <codeph>DOUBLE</codeph> values because they are so commonly used in scientific context for + calculations of IEEE 954-compliant values. The built-in functions that accept and return + <codeph>DECIMAL</codeph> are: +<!-- List from Skye: positive, negative, least, greatest, fnv_hash, if, nullif, zeroifnull, isnull, coalesce --> +<!-- Nong had already told me about abs, ceil, floor, round, truncate --> + <ul> + <li> + <codeph>ABS()</codeph> + </li> + + <li> + <codeph>CEIL()</codeph> + </li> + + <li> + <codeph>COALESCE()</codeph> + </li> + + <li> + <codeph>FLOOR()</codeph> + </li> + + <li> + <codeph>FNV_HASH()</codeph> + </li> + + <li> + <codeph>GREATEST()</codeph> + </li> + + <li> + <codeph>IF()</codeph> + </li> + + <li> + <codeph>ISNULL()</codeph> + </li> + + <li> + <codeph>LEAST()</codeph> + </li> + + <li> + <codeph>NEGATIVE()</codeph> + </li> + + <li> + <codeph>NULLIF()</codeph> + </li> + + <li> + <codeph>POSITIVE()</codeph> + </li> + + <li> + <codeph>PRECISION()</codeph> + </li> + + <li> + <codeph>ROUND()</codeph> + </li> + + <li> + <codeph>SCALE()</codeph> + </li> + + <li> + <codeph>TRUNCATE()</codeph> + </li> + + <li> + <codeph>ZEROIFNULL()</codeph> + </li> + </ul> + See <xref href="impala_functions.xml#builtins"/> for details. + </li> + + <li> + <p> + <codeph>BIGINT</codeph>, <codeph>INT</codeph>, <codeph>SMALLINT</codeph>, and <codeph>TINYINT</codeph> + values can all be cast to <codeph>DECIMAL</codeph>. The number of digits to the left of the decimal point + in the <codeph>DECIMAL</codeph> type must be sufficient to hold the largest value of the corresponding + integer type. Note that integer literals are treated as the smallest appropriate integer type, meaning + there is sometimes a range of values that require one more digit of <codeph>DECIMAL</codeph> scale than + you might expect. For integer values, the precision of the <codeph>DECIMAL</codeph> type can be zero; if + the precision is greater than zero, remember to increase the scale value by an equivalent amount to hold + the required number of digits to the left of the decimal point. + </p> + <p> + The following examples show how different integer types are converted to <codeph>DECIMAL</codeph>. + </p> +<!-- According to Nong, it's a bug that so many integer digits can be converted to a DECIMAL + value with small (s,p) spec. So expect to re-do this example. --> +<codeblock>[localhost:21000] > select cast(1 as decimal(1,0)); ++-------------------------+ +| cast(1 as decimal(1,0)) | ++-------------------------+ +| 1 | ++-------------------------+ +[localhost:21000] > select cast(9 as decimal(1,0)); ++-------------------------+ +| cast(9 as decimal(1,0)) | ++-------------------------+ +| 9 | ++-------------------------+ +[localhost:21000] > select cast(10 as decimal(1,0)); ++--------------------------+ +| cast(10 as decimal(1,0)) | ++--------------------------+ +| 10 | ++--------------------------+ +[localhost:21000] > select cast(10 as decimal(1,1)); ++--------------------------+ +| cast(10 as decimal(1,1)) | ++--------------------------+ +| 10.0 | ++--------------------------+ +[localhost:21000] > select cast(100 as decimal(1,1)); ++---------------------------+ +| cast(100 as decimal(1,1)) | ++---------------------------+ +| 100.0 | ++---------------------------+ +[localhost:21000] > select cast(1000 as decimal(1,1)); ++----------------------------+ +| cast(1000 as decimal(1,1)) | ++----------------------------+ +| 1000.0 | ++----------------------------+ +</codeblock> + </li> + + <li> + <p> + When a <codeph>DECIMAL</codeph> value is converted to any of the integer types, any fractional part is + truncated (that is, rounded towards zero): + </p> +<codeblock>[localhost:21000] > create table num_dec_days (x decimal(4,1)); +[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1))); +[localhost:21000] > insert into num_dec_days values (cast(0.1 as decimal(4,1))), (cast(.9 as decimal(4,1))), (cast(9.1 as decimal(4,1))), (cast(9.9 as decimal(4,1))); +[localhost:21000] > select cast(x as int) from num_dec_days; ++----------------+ +| cast(x as int) | ++----------------+ +| 1 | +| 2 | +| 4 | +| 0 | +| 0 | +| 9 | +| 9 | ++----------------+ +</codeblock> + </li> + + <li> + <p> + You cannot directly cast <codeph>TIMESTAMP</codeph> or <codeph>BOOLEAN</codeph> values to or from + <codeph>DECIMAL</codeph> values. You can turn a <codeph>DECIMAL</codeph> value into a time-related + representation using a two-step process, by converting it to an integer value and then using that result + in a call to a date and time function such as <codeph>from_unixtime()</codeph>. + </p> +<codeblock>[localhost:21000] > select from_unixtime(cast(cast(1000.0 as decimal) as bigint)); ++-------------------------------------------------------------+ +| from_unixtime(cast(cast(1000.0 as decimal(9,0)) as bigint)) | ++-------------------------------------------------------------+ +| 1970-01-01 00:16:40 | ++-------------------------------------------------------------+ +[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- x is a DECIMAL column. + +[localhost:21000] > create table num_dec_days (x decimal(4,1)); +[localhost:21000] > insert into num_dec_days values (1), (2), (cast(4.5 as decimal(4,1))); +[localhost:21000] > select now() + interval cast(x as int) days from num_dec_days; -- The 4.5 value is truncated to 4 and becomes '4 days'. ++--------------------------------------+ +| now() + interval cast(x as int) days | ++--------------------------------------+ +| 2014-05-13 23:11:55.163284000 | +| 2014-05-14 23:11:55.163284000 | +| 2014-05-16 23:11:55.163284000 | ++--------------------------------------+ +</codeblock> + </li> + + <li> + <p> + Because values in <codeph>INSERT</codeph> statements are checked rigorously for type compatibility, be + prepared to use <codeph>CAST()</codeph> function calls around literals, column references, or other + expressions that you are inserting into a <codeph>DECIMAL</codeph> column. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/null_bad_numeric_cast"/> + + <p> + <b>DECIMAL differences from integer and floating-point types:</b> + </p> + + <p> + With the <codeph>DECIMAL</codeph> type, you are concerned with the number of overall digits of a number + rather than powers of 2 (as in <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>, and so on). Therefore, + the limits with integral values of <codeph>DECIMAL</codeph> types fall around 99, 999, 9999, and so on rather + than 32767, 65535, 2 + <sup>32</sup> + -1, and so on. For fractional values, you do not need to account for imprecise representation of the + fractional part according to the IEEE-954 standard (as in <codeph>FLOAT</codeph> and + <codeph>DOUBLE</codeph>). Therefore, when you insert a fractional value into a <codeph>DECIMAL</codeph> + column, you can compare, sum, query, <codeph>GROUP BY</codeph>, and so on that column and get back the + original values rather than some <q>close but not identical</q> value. + </p> + + <p> + <codeph>FLOAT</codeph> and <codeph>DOUBLE</codeph> can cause problems or unexpected behavior due to inability + to precisely represent certain fractional values, for example dollar and cents values for currency. You might + find output values slightly different than you inserted, equality tests that do not match precisely, or + unexpected values for <codeph>GROUP BY</codeph> columns. <codeph>DECIMAL</codeph> can help reduce unexpected + behavior and rounding errors, at the expense of some performance overhead for assignments and comparisons. + </p> + + <p> + <b>Literals and expressions:</b> + <ul> + <li> + <p> + When you use an integer literal such as <codeph>1</codeph> or <codeph>999</codeph> in a SQL statement, + depending on the context, Impala will treat it as either the smallest appropriate + <codeph>DECIMAL</codeph> type, or the smallest integer type (<codeph>TINYINT</codeph>, + <codeph>SMALLINT</codeph>, <codeph>INT</codeph>, or <codeph>BIGINT</codeph>). To minimize memory usage, + Impala prefers to treat the literal as the smallest appropriate integer type. + </p> + </li> + + <li> + <p> + When you use a floating-point literal such as <codeph>1.1</codeph> or <codeph>999.44</codeph> in a SQL + statement, depending on the context, Impala will treat it as either the smallest appropriate + <codeph>DECIMAL</codeph> type, or the smallest floating-point type (<codeph>FLOAT</codeph> or + <codeph>DOUBLE</codeph>). To avoid loss of accuracy, Impala prefers to treat the literal as a + <codeph>DECIMAL</codeph>. + </p> + </li> + </ul> + </p> + + <p> + <b>Storage considerations:</b> + </p> + + <ul> + <li> + Only the precision determines the storage size for <codeph>DECIMAL</codeph> values; the scale setting has + no effect on the storage size. + </li> + + <li> + Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these text-based file formats, + leading zeros are not stored, but trailing zeros are stored. In these tables, each <codeph>DECIMAL</codeph> + value takes up as many bytes as there are digits in the value, plus an extra byte if the decimal point is + present and an extra byte for negative values. Once the values are loaded into memory, they are represented + in 4, 8, or 16 bytes as described in the following list items. The on-disk representation varies depending + on the file format of the table. + </li> + +<!-- Next couple of points can be conref'ed with identical list bullets farther down under File Format Considerations. --> + + <li> + Parquet and Avro tables use binary formats, In these tables, Impala stores each value in as few bytes as + possible +<!-- 4, 8, or 16 bytes --> + depending on the precision specified for the <codeph>DECIMAL</codeph> column. + <ul> + <li> + In memory, <codeph>DECIMAL</codeph> values with precision of 9 or less are stored in 4 bytes. + </li> + + <li> + In memory, <codeph>DECIMAL</codeph> values with precision of 10 through 18 are stored in 8 bytes. + </li> + + <li> + In memory, <codeph>DECIMAL</codeph> values with precision greater than 18 are stored in 16 bytes. + </li> + </ul> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/file_format_blurb"/> + + <ul> + <li> + The <codeph>DECIMAL</codeph> data type can be stored in any of the file formats supported by Impala, as + described in <xref href="impala_file_formats.xml#file_formats"/>. Impala only writes to tables that use the + Parquet and text formats, so those formats are the focus for file format compatibility. + </li> + + <li> + Impala can query Avro, RCFile, or SequenceFile tables containing <codeph>DECIMAL</codeph> columns, created + by other Hadoop components, on CDH 5 only. + </li> + + <li> + You can use <codeph>DECIMAL</codeph> columns in Impala tables that are mapped to HBase tables. Impala can + query and insert into such tables. + </li> + + <li> + Text, RCFile, and SequenceFile tables all use ASCII-based formats. In these tables, each + <codeph>DECIMAL</codeph> value takes up as many bytes as there are digits in the value, plus an extra byte + if the decimal point is present. The binary format of Parquet or Avro files offers more compact storage for + <codeph>DECIMAL</codeph> columns. + </li> + + <li> + Parquet and Avro tables use binary formats, In these tables, Impala stores each value in 4, 8, or 16 bytes + depending on the precision specified for the <codeph>DECIMAL</codeph> column. + </li> + + <li> + Parquet files containing <codeph>DECIMAL</codeph> columns are not expected to be readable under CDH 4. See + the <b>Compatibility</b> section for details. + </li> + </ul> + + <p> + <b>UDF considerations:</b> When writing a C++ UDF, use the <codeph>DecimalVal</codeph> data type defined in + <filepath>/usr/include/impala_udf/udf.h</filepath>. + </p> + + <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> + + <p> + You can use a <codeph>DECIMAL</codeph> column as a partition key. Doing so provides a better match between + the partition key values and the HDFS directory names than using a <codeph>DOUBLE</codeph> or + <codeph>FLOAT</codeph> partitioning column: + </p> + + <p conref="../shared/impala_common.xml#common/schema_evolution_blurb"/> + + <ul> + <li> + For text-based formats (text, RCFile, and SequenceFile tables), you can issue an <codeph>ALTER TABLE ... + REPLACE COLUMNS</codeph> statement to change the precision and scale of an existing + <codeph>DECIMAL</codeph> column. As long as the values in the column fit within the new precision and + scale, they are returned correctly by a query. Any values that do not fit within the new precision and + scale are returned as <codeph>NULL</codeph>, and Impala reports the conversion error. Leading zeros do not + count against the precision value, but trailing zeros after the decimal point do. +<codeblock>[localhost:21000] > create table text_decimals (x string); +[localhost:21000] > insert into text_decimals values ("1"), ("2"), ("99.99"), ("1.234"), ("000001"), ("1.000000000"); +[localhost:21000] > select * from text_decimals; ++-------------+ +| x | ++-------------+ +| 1 | +| 2 | +| 99.99 | +| 1.234 | +| 000001 | +| 1.000000000 | ++-------------+ +[localhost:21000] > alter table text_decimals replace columns (x decimal(4,2)); +[localhost:21000] > select * from text_decimals; ++-------+ +| x | ++-------+ +| 1.00 | +| 2.00 | +| 99.99 | +| NULL | +| 1.00 | +| NULL | ++-------+ +ERRORS: +Backend 0:Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.234) +file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/634d4bd3aa0 +e8420-b4b13bab7f1be787_56794587_data.0 +record: 1.234 +Error converting column: 0 TO DECIMAL(4, 2) (Data is: 1.000000000) +file: hdfs://127.0.0.1:8020/user/hive/warehouse/decimal_testing.db/text_decimals/cd40dc68e20 +c565a-cc4bd86c724c96ba_311873428_data.0 +record: 1.000000000 +</codeblock> + </li> + + <li> + For binary formats (Parquet and Avro tables), although an <codeph>ALTER TABLE ... REPLACE COLUMNS</codeph> + statement that changes the precision or scale of a <codeph>DECIMAL</codeph> column succeeds, any subsequent + attempt to query the changed column results in a fatal error. (The other columns can still be queried + successfully.) This is because the metadata about the columns is stored in the data files themselves, and + <codeph>ALTER TABLE</codeph> does not actually make any updates to the data files. If the metadata in the + data files disagrees with the metadata in the metastore database, Impala cancels the query. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>CREATE TABLE t1 (x DECIMAL, y DECIMAL(5,2), z DECIMAL(25,0)); +INSERT INTO t1 VALUES (5, 99.44, 123456), (300, 6.7, 999999999); +SELECT x+y, ROUND(y,1), z/98.6 FROM t1; +SELECT CAST(1000.5 AS DECIMAL); +</codeblock> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/decimal_no_stats"/> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/partitioning_good"/> --> + + <p conref="../shared/impala_common.xml#common/hbase_ok"/> + + <p conref="../shared/impala_common.xml#common/parquet_ok"/> + + <p conref="../shared/impala_common.xml#common/text_bulky"/> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/compatibility_blurb"/> --> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/internals_blurb"/> --> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/added_in_20"/> --> + + <p conref="../shared/impala_common.xml#common/column_stats_constant"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_literals.xml#numeric_literals"/>, <xref href="impala_tinyint.xml#tinyint"/>, + <xref href="impala_smallint.xml#smallint"/>, <xref href="impala_int.xml#int"/>, + <xref href="impala_bigint.xml#bigint"/>, <xref href="impala_decimal.xml#decimal"/>, + <xref href="impala_math_functions.xml#math_functions"/> (especially <codeph>PRECISION()</codeph> and + <codeph>SCALE()</codeph>) + </p> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_default_order_by_limit.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_default_order_by_limit.xml b/docs/topics/impala_default_order_by_limit.xml new file mode 100644 index 0000000..def0335 --- /dev/null +++ b/docs/topics/impala_default_order_by_limit.xml @@ -0,0 +1,34 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="obwl" id="default_order_by_limit"> + + <title>DEFAULT_ORDER_BY_LIMIT Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p conref="../shared/impala_common.xml#common/obwl_query_options"/> + + <p rev="1.4.0"> + Prior to Impala 1.4.0, Impala queries that use the <codeph><xref href="impala_order_by.xml#order_by">ORDER + BY</xref></codeph> clause must also include a + <codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause, to avoid accidentally producing + huge result sets that must be sorted. Sorting a huge result set is a memory-intensive operation. In Impala + 1.4.0 and higher, Impala uses a temporary disk work area to perform the sort if that operation would + otherwise exceed the Impala memory limit on a particular host. + </p> + + <p> + <b>Type: numeric</b> + </p> + + <p> + <b>Default:</b> -1 (no default limit) + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_delete.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_delete.xml b/docs/topics/impala_delete.xml new file mode 100644 index 0000000..fcac5e4 --- /dev/null +++ b/docs/topics/impala_delete.xml @@ -0,0 +1,64 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="delete"> + + <title>DELETE Statement (CDH 5.5 and higher only)</title> + <titlealts><navtitle>DELETE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Kudu"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DELETE statement</indexterm> + Deletes one or more rows from a Kudu table. + Although deleting a single row or a range of rows would be inefficient for tables using HDFS + data files, Kudu is able to perform this operation efficiently. Therefore, this statement + only works for Impala tables that use the Kudu storage engine. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock> +</codeblock> + + <p rev="kudu" audience="impala_next"> + Normally, a <codeph>DELETE</codeph> operation for a Kudu table fails if + some partition key columns are not found, due to their being deleted or changed + by a concurrent <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> operation. + Specify <codeph>DELETE IGNORE <varname>rest_of_statement</varname></codeph> to + make the <codeph>DELETE</codeph> continue in this case. The rows with the nonexistent + duplicate partition key column values are not removed. + </p> + + <p conref="../shared/impala_common.xml#common/dml_blurb"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <note conref="../shared/impala_common.xml#common/compute_stats_next"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> +<codeblock> + +</codeblock> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_kudu.xml#impala_kudu"/> + </p> + + </conbody> + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_describe.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_describe.xml b/docs/topics/impala_describe.xml new file mode 100644 index 0000000..ffdb505 --- /dev/null +++ b/docs/topics/impala_describe.xml @@ -0,0 +1,561 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="describe"> + + <title id="desc">DESCRIBE Statement</title> + <titlealts><navtitle>DESCRIBE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Reports"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DESCRIBE statement</indexterm> + The <codeph>DESCRIBE</codeph> statement displays metadata about a table, such as the column names and their + data types. Its syntax is: + </p> + +<codeblock rev="2.3.0">DESCRIBE [FORMATTED] [<varname>db_name</varname>.]<varname>table_name</varname>[.<varname>complex_col_name</varname> ...]</codeblock> + + <p> + You can use the abbreviation <codeph>DESC</codeph> for the <codeph>DESCRIBE</codeph> statement. + </p> + + <p rev="1.1"> + The <codeph>DESCRIBE FORMATTED</codeph> variation displays additional information, in a format familiar to + users of Apache Hive. The extra information includes low-level details such as whether the table is internal + or external, when it was created, the file format, the location of the data in HDFS, whether the object is a + table or a view, and (for views) the text of the query from the view definition. + </p> + + <note> + The <codeph>Compressed</codeph> field is not a reliable indicator of whether the table contains compressed + data. It typically always shows <codeph>No</codeph>, because the compression settings only apply during the + session that loads data and are not stored persistently with the table metadata. + </note> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p rev="2.3.0"> + For the <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph> types available in + CDH 5.5 / Impala 2.3 and higher, the <codeph>DESCRIBE</codeph> output is formatted to avoid + excessively long lines for multiple fields within a <codeph>STRUCT</codeph>, or a nested sequence of + complex types. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_describe"/> + + <p rev="2.3.0"> + For example, here is the <codeph>DESCRIBE</codeph> output for a table containing a single top-level column + of each complex type: + </p> + +<codeblock rev="2.3.0"><![CDATA[create table t1 (x int, a array<int>, s struct<f1: string, f2: bigint>, m map<string,int>) stored as parquet; + +describe t1; ++------+-----------------+---------+ +| name | type | comment | ++------+-----------------+---------+ +| x | int | | +| a | array<int> | | +| s | struct< | | +| | f1:string, | | +| | f2:bigint | | +| | > | | +| m | map<string,int> | | ++------+-----------------+---------+ +]]> +</codeblock> + + <p rev="2.3.0"> + Here are examples showing how to <q>drill down</q> into the layouts of complex types, including + using multi-part names to examine the definitions of nested types. + The <codeph>< ></codeph> delimiters identify the columns with complex types; + these are the columns where you can descend another level to see the parts that make up + the complex type. + This technique helps you to understand the multi-part names you use as table references in queries + involving complex types, and the corresponding column names you refer to in the <codeph>SELECT</codeph> list. + These tables are from the <q>nested TPC-H</q> schema, shown in detail in + <xref href="impala_complex_types.xml#complex_sample_schema"/>. + </p> + + <p> + The <codeph>REGION</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements: + </p> + + <ul> + <li> + <p> + The first <codeph>DESCRIBE</codeph> specifies the table name, to display the definition + of each top-level column. + </p> + </li> + <li> + <p> + The second <codeph>DESCRIBE</codeph> specifies the name of a complex + column, <codeph>REGION.R_NATIONS</codeph>, showing that when you include the name of an <codeph>ARRAY</codeph> + column in a <codeph>FROM</codeph> clause, that table reference acts like a two-column table with + columns <codeph>ITEM</codeph> and <codeph>POS</codeph>. + </p> + </li> + <li> + <p> + The final <codeph>DESCRIBE</codeph> specifies the fully qualified name of the <codeph>ITEM</codeph> field, + to display the layout of its underlying <codeph>STRUCT</codeph> type in table format, with the fields + mapped to column names. + </p> + </li> + </ul> + +<codeblock rev="2.3.0"><![CDATA[ +-- #1: The overall layout of the entire table. +describe region; ++-------------+-------------------------+---------+ +| name | type | comment | ++-------------+-------------------------+---------+ +| r_regionkey | smallint | | +| r_name | string | | +| r_comment | string | | +| r_nations | array<struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | >> | | ++-------------+-------------------------+---------+ + +-- #2: The ARRAY column within the table. +describe region.r_nations; ++------+-------------------------+---------+ +| name | type | comment | ++------+-------------------------+---------+ +| item | struct< | | +| | n_nationkey:smallint, | | +| | n_name:string, | | +| | n_comment:string | | +| | > | | +| pos | bigint | | ++------+-------------------------+---------+ + +-- #3: The STRUCT that makes up each ARRAY element. +-- The fields of the STRUCT act like columns of a table. +describe region.r_nations.item; ++-------------+----------+---------+ +| name | type | comment | ++-------------+----------+---------+ +| n_nationkey | smallint | | +| n_name | string | | +| n_comment | string | | ++-------------+----------+---------+ +]]> +</codeblock> + + <p> + The <codeph>CUSTOMER</codeph> table contains an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph> + elements, where one field in the <codeph>STRUCT</codeph> is another <codeph>ARRAY</codeph> of + <codeph>STRUCT</codeph> elements: + </p> + <ul> + <li> + <p> + Again, the initial <codeph>DESCRIBE</codeph> specifies only the table name. + </p> + </li> + <li> + <p> + The second <codeph>DESCRIBE</codeph> specifies the qualified name of the complex + column, <codeph>CUSTOMER.C_ORDERS</codeph>, showing how an <codeph>ARRAY</codeph> + is represented as a two-column table with columns <codeph>ITEM</codeph> and <codeph>POS</codeph>. + </p> + </li> + <li> + <p> + The third <codeph>DESCRIBE</codeph> specifies the qualified name of the <codeph>ITEM</codeph> + of the <codeph>ARRAY</codeph> column, to see the structure of the nested <codeph>ARRAY</codeph>. + Again, it has has two parts, <codeph>ITEM</codeph> and <codeph>POS</codeph>. Because the + <codeph>ARRAY</codeph> contains a <codeph>STRUCT</codeph>, the layout of the <codeph>STRUCT</codeph> + is shown. + </p> + </li> + <li> + <p> + The fourth and fifth <codeph>DESCRIBE</codeph> statements drill down into a <codeph>STRUCT</codeph> field that + is itself a complex type, an <codeph>ARRAY</codeph> of <codeph>STRUCT</codeph>. + The <codeph>ITEM</codeph> portion of the qualified name is only required when the <codeph>ARRAY</codeph> + elements are anonymous. The fields of the <codeph>STRUCT</codeph> give names to any other complex types + nested inside the <codeph>STRUCT</codeph>. Therefore, the <codeph>DESCRIBE</codeph> parameters + <codeph>CUSTOMER.C_ORDERS.ITEM.O_LINEITEMS</codeph> and <codeph>CUSTOMER.C_ORDERS.O_LINEITEMS</codeph> + are equivalent. (For brevity, Cloudera recommends leaving out the <codeph>ITEM</codeph> portion of + a qualified name when it is not required.) + </p> + </li> + <li> + <p> + The final <codeph>DESCRIBE</codeph> shows the layout of the deeply nested <codeph>STRUCT</codeph> type. + Because there are no more complex types nested inside this <codeph>STRUCT</codeph>, this is as far + as you can drill down into the layout for this table. + </p> + </li> + </ul> + +<codeblock rev="2.3.0"><![CDATA[-- #1: The overall layout of the entire table. +describe customer; ++--------------+------------------------------------+ +| name | type | ++--------------+------------------------------------+ +| c_custkey | bigint | +... more scalar columns ... +| c_orders | array<struct< | +| | o_orderkey:bigint, | +| | o_orderstatus:string, | +| | o_totalprice:decimal(12,2), | +| | o_orderdate:string, | +| | o_orderpriority:string, | +| | o_clerk:string, | +| | o_shippriority:int, | +| | o_comment:string, | +| | o_lineitems:array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +| | l_linenumber:int, | +| | l_quantity:decimal(12,2), | +| | l_extendedprice:decimal(12,2), | +| | l_discount:decimal(12,2), | +| | l_tax:decimal(12,2), | +| | l_returnflag:string, | +| | l_linestatus:string, | +| | l_shipdate:string, | +| | l_commitdate:string, | +| | l_receiptdate:string, | +| | l_shipinstruct:string, | +| | l_shipmode:string, | +| | l_comment:string | +| | >> | +| | >> | ++--------------+------------------------------------+ + +-- #2: The ARRAY column within the table. +describe customer.c_orders; ++------+------------------------------------+ +| name | type | ++------+------------------------------------+ +| item | struct< | +| | o_orderkey:bigint, | +| | o_orderstatus:string, | +... more struct fields ... +| | o_lineitems:array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more nested struct fields ... +| | l_comment:string | +| | >> | +| | > | +| pos | bigint | ++------+------------------------------------+ + +-- #3: The STRUCT that makes up each ARRAY element. +-- The fields of the STRUCT act like columns of a table. +describe customer.c_orders.item; ++-----------------+----------------------------------+ +| name | type | ++-----------------+----------------------------------+ +| o_orderkey | bigint | +| o_orderstatus | string | +| o_totalprice | decimal(12,2) | +| o_orderdate | string | +| o_orderpriority | string | +| o_clerk | string | +| o_shippriority | int | +| o_comment | string | +| o_lineitems | array<struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | >> | ++-----------------+----------------------------------+ + +-- #4: The ARRAY nested inside the STRUCT elements of the first ARRAY. +describe customer.c_orders.item.o_lineitems; ++------+----------------------------------+ +| name | type | ++------+----------------------------------+ +| item | struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | > | +| pos | bigint | ++------+----------------------------------+ + +-- #5: Shorter form of the previous DESCRIBE. Omits the .ITEM portion of the name +-- because O_LINEITEMS and other field names provide a way to refer to things +-- inside the ARRAY element. +describe customer.c_orders.o_lineitems; ++------+----------------------------------+ +| name | type | ++------+----------------------------------+ +| item | struct< | +| | l_partkey:bigint, | +| | l_suppkey:bigint, | +... more struct fields ... +| | l_comment:string | +| | > | +| pos | bigint | ++------+----------------------------------+ + +-- #6: The STRUCT representing ARRAY elements nested inside +-- another ARRAY of STRUCTs. The lack of any complex types +-- in this output means this is as far as DESCRIBE can +-- descend into the table layout. +describe customer.c_orders.o_lineitems.item; ++-----------------+---------------+ +| name | type | ++-----------------+---------------+ +| l_partkey | bigint | +| l_suppkey | bigint | +... more scalar columns ... +| l_comment | string | ++-----------------+---------------+ +]]> +</codeblock> + +<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + +<p> + After the <cmdname>impalad</cmdname> daemons are restarted, the first query against a table can take longer + than subsequent queries, because the metadata for the table is loaded before the query is processed. This + one-time delay for each table can cause misleading results in benchmark tests or cause unnecessary concern. + To <q>warm up</q> the Impala metadata cache, you can issue a <codeph>DESCRIBE</codeph> statement in advance + for each table you intend to access later. +</p> + +<p> + When you are dealing with data files stored in HDFS, sometimes it is important to know details such as the + path of the data files for an Impala table, and the host name for the namenode. You can get this information + from the <codeph>DESCRIBE FORMATTED</codeph> output. You specify HDFS URIs or path specifications with + statements such as <codeph>LOAD DATA</codeph> and the <codeph>LOCATION</codeph> clause of <codeph>CREATE + TABLE</codeph> or <codeph>ALTER TABLE</codeph>. You might also use HDFS URIs or paths with Linux commands + such as <cmdname>hadoop</cmdname> and <cmdname>hdfs</cmdname> to copy, rename, and so on, data files in HDFS. +</p> + +<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + +<p rev="1.2.1"> + Each table can also have associated table statistics and column statistics. To see these categories of + information, use the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and <codeph>SHOW COLUMN + STATS <varname>table_name</varname></codeph> statements. +<!-- +For example, the table statistics can often show you the number +and total size of the files in the table, even if you have not +run <codeph>COMPUTE STATS</codeph>. +--> + See <xref href="impala_show.xml#show"/> for details. +</p> + +<p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + +<p rev="2.3.0"> + Because the column definitions for complex types can become long, particularly when such types are nested, + the <codeph>DESCRIBE</codeph> statement uses special formatting for complex type columns to make the output readable. +</p> + +<note conref="../shared/impala_common.xml#common/compute_stats_next"/> + +<p conref="../shared/impala_common.xml#common/example_blurb"/> + +<p> + The following example shows the results of both a standard <codeph>DESCRIBE</codeph> and <codeph>DESCRIBE + FORMATTED</codeph> for different kinds of schema objects: +</p> + +<ul> + <li> + <codeph>DESCRIBE</codeph> for a table or a view returns the name, type, and comment for each of the + columns. For a view, if the column value is computed by an expression, the column name is automatically + generated as <codeph>_c0</codeph>, <codeph>_c1</codeph>, and so on depending on the ordinal number of the + column. + </li> + + <li> + A table created with no special format or storage clauses is designated as a <codeph>MANAGED_TABLE</codeph> + (an <q>internal table</q> in Impala terminology). Its data files are stored in an HDFS directory under the + default Hive data directory. By default, it uses Text data format. + </li> + + <li> + A view is designated as <codeph>VIRTUAL_VIEW</codeph> in <codeph>DESCRIBE FORMATTED</codeph> output. Some + of its properties are <codeph>NULL</codeph> or blank because they are inherited from the base table. The + text of the query that defines the view is part of the <codeph>DESCRIBE FORMATTED</codeph> output. + </li> + + <li> + A table with additional clauses in the <codeph>CREATE TABLE</codeph> statement has differences in + <codeph>DESCRIBE FORMATTED</codeph> output. The output for <codeph>T2</codeph> includes the + <codeph>EXTERNAL_TABLE</codeph> keyword because of the <codeph>CREATE EXTERNAL TABLE</codeph> syntax, and + different <codeph>InputFormat</codeph> and <codeph>OutputFormat</codeph> fields to reflect the Parquet file + format. + </li> + </ul> + +<codeblock>[localhost:21000] > create table t1 (x int, y int, s string); +Query: create table t1 (x int, y int, s string) +[localhost:21000] > describe t1; +Query: describe t1 +Query finished, fetching results ... ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| y | int | | +| s | string | | ++------+--------+---------+ +Returned 3 row(s) in 0.13s +[localhost:21000] > describe formatted t1; +Query: describe formatted t1 +Query finished, fetching results ... ++------------------------------+--------------------------------------------------------------------+----------------------+ +| name | type | comment | ++------------------------------+--------------------------------------------------------------------+----------------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| y | int | None | +| s | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 17:03:16 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | hdfs://127.0.0.1:8020/user/hive/warehouse/describe_formatted.db/t1 | NULL | +| Table Type: | MANAGED_TABLE | NULL | +| Table Parameters: | NULL | NULL | +| | transient_lastDdlTime | 1374526996 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | +| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | +| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | ++------------------------------+--------------------------------------------------------------------+----------------------+ +Returned 26 row(s) in 0.03s +[localhost:21000] > create view v1 as select x, upper(s) from t1; +Query: create view v1 as select x, upper(s) from t1 +[localhost:21000] > describe v1; +Query: describe v1 +Query finished, fetching results ... ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | int | | +| _c1 | string | | ++------+--------+---------+ +Returned 2 row(s) in 0.10s +[localhost:21000] > describe formatted v1; +Query: describe formatted v1 +Query finished, fetching results ... ++------------------------------+------------------------------+----------------------+ +| name | type | comment | ++------------------------------+------------------------------+----------------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| _c1 | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 16:56:38 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Table Type: | VIRTUAL_VIEW | NULL | +| Table Parameters: | NULL | NULL | +| | transient_lastDdlTime | 1374526598 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | null | NULL | +| InputFormat: | null | NULL | +| OutputFormat: | null | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | +| | NULL | NULL | +| # View Information | NULL | NULL | +| View Original Text: | SELECT x, upper(s) FROM t1 | NULL | +| View Expanded Text: | SELECT x, upper(s) FROM t1 | NULL | ++------------------------------+------------------------------+----------------------+ +Returned 28 row(s) in 0.03s +[localhost:21000] > create external table t2 (x int, y int, s string) stored as parquet location '/user/cloudera/sample_data'; +[localhost:21000] > describe formatted t2; +Query: describe formatted t2 +Query finished, fetching results ... ++------------------------------+----------------------------------------------------+----------------------+ +| name | type | comment | ++------------------------------+----------------------------------------------------+----------------------+ +| # col_name | data_type | comment | +| | NULL | NULL | +| x | int | None | +| y | int | None | +| s | string | None | +| | NULL | NULL | +| # Detailed Table Information | NULL | NULL | +| Database: | describe_formatted | NULL | +| Owner: | cloudera | NULL | +| CreateTime: | Mon Jul 22 17:01:47 EDT 2013 | NULL | +| LastAccessTime: | UNKNOWN | NULL | +| Protect Mode: | None | NULL | +| Retention: | 0 | NULL | +| Location: | hdfs://127.0.0.1:8020/user/cloudera/sample_data | NULL | +| Table Type: | EXTERNAL_TABLE | NULL | +| Table Parameters: | NULL | NULL | +| | EXTERNAL | TRUE | +| | transient_lastDdlTime | 1374526907 | +| | NULL | NULL | +| # Storage Information | NULL | NULL | +| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | +| InputFormat: | com.cloudera.impala.hive.serde.ParquetInputFormat | NULL | +| OutputFormat: | com.cloudera.impala.hive.serde.ParquetOutputFormat | NULL | +| Compressed: | No | NULL | +| Num Buckets: | 0 | NULL | +| Bucket Columns: | [] | NULL | +| Sort Columns: | [] | NULL | ++------------------------------+----------------------------------------------------+----------------------+ +Returned 27 row(s) in 0.17s</codeblock> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read and execute + permissions for all directories that are part of the table. + (A table could span multiple different HDFS directories if it is partitioned. + The directories could be widely scattered because a partition can reside + in an arbitrary HDFS directory based on its <codeph>LOCATION</codeph> attribute.) + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_tables.xml#tables"/>, <xref href="impala_create_table.xml#create_table"/>, + <xref href="impala_show.xml#show_tables"/>, <xref href="impala_show.xml#show_create_table"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_disable_codegen.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_codegen.xml b/docs/topics/impala_disable_codegen.xml new file mode 100644 index 0000000..844d49d --- /dev/null +++ b/docs/topics/impala_disable_codegen.xml @@ -0,0 +1,36 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="disable_codegen"> + + <title>DISABLE_CODEGEN Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Troubleshooting"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISABLE_CODEGEN query option</indexterm> + This is a debug option, intended for diagnosing and working around issues that cause crashes. If a query + fails with an <q>illegal instruction</q> or other hardware-specific message, try setting + <codeph>DISABLE_CODEGEN=true</codeph> and running the query again. If the query succeeds only when the + <codeph>DISABLE_CODEGEN</codeph> option is turned on, submit the problem to Cloudera support and include that + detail in the problem report. Do not otherwise run with this setting turned on, because it results in lower + overall performance. + </p> + + <p> + Because the code generation phase adds a small amount of overhead for each query, you might turn on the + <codeph>DISABLE_CODEGEN</codeph> option to achieve maximum throughput when running many short-lived queries + against small tables. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_disable_unsafe_spills.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_disable_unsafe_spills.xml b/docs/topics/impala_disable_unsafe_spills.xml new file mode 100644 index 0000000..f251d65 --- /dev/null +++ b/docs/topics/impala_disable_unsafe_spills.xml @@ -0,0 +1,48 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="disable_unsafe_spills"> + + <title>DISABLE_UNSAFE_SPILLS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Scalability"/> + <data name="Category" value="Memory"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISABLE_UNSAFE_SPILLS query option</indexterm> + Enable this option if you prefer to have queries fail when they exceed the Impala memory limit, rather than + write temporary data to disk. + </p> + + <p> + Queries that <q>spill</q> to disk typically complete successfully, when in earlier Impala releases they would have failed. + However, queries with exorbitant memory requirements due to missing statistics or inefficient join clauses could + become so slow as a result that you would rather have them cancelled automatically and reduce the memory + usage through standard Impala tuning techniques. + </p> + + <p> + This option prevents only <q>unsafe</q> spill operations, meaning that one or more tables are missing + statistics or the query does not include a hint to set the most efficient mechanism for a join or + <codeph>INSERT ... SELECT</codeph> into a partitioned table. These are the tables most likely to result in + suboptimal execution plans that could cause unnecessary spilling. Therefore, leaving this option enabled is a + good way to find tables on which to run the <codeph>COMPUTE STATS</codeph> statement. + </p> + + <p> + See <xref href="impala_scalability.xml#spill_to_disk"/> for information about the <q>spill to disk</q> + feature for queries processing large result sets with joins, <codeph>ORDER BY</codeph>, <codeph>GROUP + BY</codeph>, <codeph>DISTINCT</codeph>, aggregation functions, or analytic functions. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_distinct.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_distinct.xml b/docs/topics/impala_distinct.xml new file mode 100644 index 0000000..d49e400 --- /dev/null +++ b/docs/topics/impala_distinct.xml @@ -0,0 +1,59 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="distinct"> + + <title>DISTINCT Operator</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Aggregate Functions"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">DISTINCT operator</indexterm> + The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to + remove duplicates: + </p> + +<codeblock>-- Returns the unique values from one column. +-- NULL is included in the set of values if any rows have a NULL in this column. +select distinct c_birth_country from customer; +-- Returns the unique combinations of values from multiple columns. +select distinct c_salutation, c_last_name from customer;</codeblock> + + <p> + You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically + <codeph>COUNT()</codeph>, to find how many different values a column contains: + </p> + +<codeblock>-- Counts the unique values from one column. +-- NULL is not included as a distinct value in the count. +select count(distinct c_birth_country) from customer; +-- Counts the unique combinations of values from multiple columns. +select count(distinct c_salutation, c_last_name) from customer;</codeblock> + + <p> + One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one + aggregation function in the same query. For example, you could not have a single query with both + <codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the + <codeph>SELECT</codeph> list. + </p> + + <p conref="../shared/impala_common.xml#common/zero_length_strings"/> + + <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/> + + <note> + <p> + In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order, + Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER + BY</codeph> clause if you need the values in alphabetical or numeric sorted order. + </p> + </note> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_dml.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_dml.xml b/docs/topics/impala_dml.xml new file mode 100644 index 0000000..66d4022 --- /dev/null +++ b/docs/topics/impala_dml.xml @@ -0,0 +1,85 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="dml"> + + <title>DML Statements</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DML"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Tables"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + DML refers to <q>Data Manipulation Language</q>, a subset of SQL statements that modify the data stored in + tables. Because Impala focuses on query performance and leverages the append-only nature of HDFS storage, + currently Impala only supports a small set of DML statements: + </p> + + <ul> + <li audience="impala_next"> + <xref href="impala_delete.xml#delete"/>; works for Kudu tables only + </li> + + <li> + <xref href="impala_insert.xml#insert"/> + </li> + + <li> + <xref href="impala_load_data.xml#load_data"/> + </li> + + <li audience="impala_next"> + <xref href="impala_update.xml#update"/>; works for Kudu tables only + </li> + </ul> + + <p> + <codeph>INSERT</codeph> in Impala is primarily optimized for inserting large volumes of data in a single + statement, to make effective use of the multi-megabyte HDFS blocks. This is the way in Impala to create new + data files. If you intend to insert one or a few rows at a time, such as using the <codeph>INSERT ... + VALUES</codeph> syntax, that technique is much more efficient for Impala tables stored in HBase. See + <xref href="impala_hbase.xml#impala_hbase"/> for details. + </p> + + <p> + <codeph>LOAD DATA</codeph> moves existing data files into the directory for an Impala table, making them + immediately available for Impala queries. This is one way in Impala to work with data files produced by other + Hadoop components. (<codeph>CREATE EXTERNAL TABLE</codeph> is the other alternative; with external tables, + you can query existing data files, while the files remain in their original location.) + </p> + + <p> + To simulate the effects of an <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> statement in other database + systems, typically you use <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> to copy data + from one table to another, filtering out or changing the appropriate rows during the copy operation. + </p> + + <p> + Although Impala currently does not have an <codeph>UPDATE</codeph> statement, you can achieve a similar + result by using Impala tables stored in HBase. When you insert a row into an HBase table, and the table + already contains a row with the same value for the key column, the older row is hidden, effectively the same + as a single-row <codeph>UPDATE</codeph>. + </p> + + <p rev="2.2.0"> + Currently, Impala cannot perform DML operations for tables or partitions stored in the Amazon S3 filesystem. + See <xref href="impala_s3.xml#s3"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + The other major classifications of SQL statements are data definition language (see + <xref href="impala_ddl.xml#ddl"/>) and queries (see <xref href="impala_select.xml#select"/>). + </p> + </conbody> +</concept>
