http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_max.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max.xml b/docs/topics/impala_max.xml new file mode 100644 index 0000000..b989785 --- /dev/null +++ b/docs/topics/impala_max.xml @@ -0,0 +1,192 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="max"> + + <title>MAX Function</title> + <titlealts><navtitle>MAX</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Analytic Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">max() function</indexterm> + An aggregate function that returns the maximum value from a set of numbers. Opposite of the + <codeph>MIN</codeph> function. Its single argument can be numeric column, or the numeric result of a function + or expression applied to the column value. Rows with a <codeph>NULL</codeph> value for the specified column + are ignored. If the table is empty, or all the values supplied to <codeph>MAX</codeph> are + <codeph>NULL</codeph>, <codeph>MAX</codeph> returns <codeph>NULL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>MAX([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock> + + <p> + When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of + grouping values. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> + + <p conref="../shared/impala_common.xml#common/return_type_same_except_string"/> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- Find the largest value for this column in the table. +select max(c1) from t1; +-- Find the largest value for this column from a subset of the table. +select max(c1) from t1 where month = 'January' and year = '2013'; +-- Find the largest value from a set of numeric function results. +select max(length(s)) from t1; +-- Can also be used in combination with DISTINCT and/or GROUP BY. +-- Return more than one result. +select month, year, max(purchase_price) from store_stats group by month, year; +-- Filter the input to eliminate duplicates before performing the calculation. +select max(distinct x) from t1; +</codeblock> + + <p rev="2.0.0"> + The following examples show how to use <codeph>MAX()</codeph> in an analytic context. They use a table + containing integers from 1 to 10. Notice how the <codeph>MAX()</codeph> is reported for each input value, as + opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set. +<codeblock>select x, property, max(x) over (partition by property) as max from int_t where property in ('odd','even'); ++----+----------+-----+ +| x | property | max | ++----+----------+-----+ +| 2 | even | 10 | +| 4 | even | 10 | +| 6 | even | 10 | +| 8 | even | 10 | +| 10 | even | 10 | +| 1 | odd | 9 | +| 3 | odd | 9 | +| 5 | odd | 9 | +| 7 | odd | 9 | +| 9 | odd | 9 | ++----+----------+-----+ +</codeblock> + +Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving +set of rows (the <q>window</q>). The following examples use <codeph>MAX()</codeph> in an analytic context +(that is, with an <codeph>OVER()</codeph> clause) to display the smallest value of <codeph>X</codeph> +encountered up to each row in the result set. The examples use two columns in the <codeph>ORDER BY</codeph> +clause to produce a sequence of values that rises and falls, to illustrate how the <codeph>MAX()</codeph> +result only increases or stays the same throughout each partition within the result set. +The basic <codeph>ORDER BY x</codeph> clause implicitly +activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +therefore all of these examples produce the same results: + +<codeblock>select x, property, + max(x) <b>over (order by property, x desc)</b> as 'maximum to this point' +from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | maximum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 7 | +| 3 | prime | 7 | +| 2 | prime | 7 | +| 9 | square | 9 | +| 4 | square | 9 | +| 1 | square | 9 | ++---+----------+-----------------------+ + +select x, property, + max(x) over + ( + <b>order by property, x desc</b> + <b>rows between unbounded preceding and current row</b> + ) as 'maximum to this point' +from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | maximum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 7 | +| 3 | prime | 7 | +| 2 | prime | 7 | +| 9 | square | 9 | +| 4 | square | 9 | +| 1 | square | 9 | ++---+----------+-----------------------+ + +select x, property, + max(x) over + ( + <b>order by property, x desc</b> + <b>range between unbounded preceding and current row</b> + ) as 'maximum to this point' +from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | maximum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 7 | +| 3 | prime | 7 | +| 2 | prime | 7 | +| 9 | square | 9 | +| 4 | square | 9 | +| 1 | square | 9 | ++---+----------+-----------------------+ +</codeblock> + +The following examples show how to construct a moving window, with a running maximum taking into account all rows before +and 1 row after the current row. +Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of +moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph> clause. +Because of an extra Impala restriction on the <codeph>MAX()</codeph> and <codeph>MIN()</codeph> functions in an +analytic context, the lower bound must be <codeph>UNBOUNDED PRECEDING</codeph>. +<codeblock>select x, property, + max(x) over + ( + <b>order by property, x</b> + <b>rows between unbounded preceding and 1 following</b> + ) as 'local maximum' +from int_t where property in ('prime','square'); ++---+----------+---------------+ +| x | property | local maximum | ++---+----------+---------------+ +| 2 | prime | 3 | +| 3 | prime | 5 | +| 5 | prime | 7 | +| 7 | prime | 7 | +| 1 | square | 7 | +| 4 | square | 9 | +| 9 | square | 9 | ++---+----------+---------------+ + +-- Doesn't work because of syntax restriction on RANGE clause. +select x, property, + max(x) over + ( + <b>order by property, x</b> + <b>range between unbounded preceding and 1 following</b> + ) as 'local maximum' +from int_t where property in ('prime','square'); +ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW. +</codeblock> + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#analytic_functions"/>, <xref href="impala_min.xml#min"/>, + <xref href="impala_avg.xml#avg"/> + </p> + </conbody> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_max_errors.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max_errors.xml b/docs/topics/impala_max_errors.xml new file mode 100644 index 0000000..86f3618 --- /dev/null +++ b/docs/topics/impala_max_errors.xml @@ -0,0 +1,44 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="max_errors"> + + <title>MAX_ERRORS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Logs"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">MAX_ERRORS query option</indexterm> + Maximum number of non-fatal errors for any particular query that are recorded in the Impala log file. For + example, if a billion-row table had a non-fatal data error in every row, you could diagnose the problem + without all billion errors being logged. Unspecified or 0 indicates the built-in default value of 1000. + </p> + + <p> + This option only controls how many errors are reported. To specify whether Impala continues or halts when it + encounters such errors, use the <codeph>ABORT_ON_ERROR</codeph> option. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 0 (meaning 1000 errors) + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_abort_on_error.xml#abort_on_error"/>, + <xref href="impala_logging.xml#logging"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_max_io_buffers.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max_io_buffers.xml b/docs/topics/impala_max_io_buffers.xml new file mode 100644 index 0000000..b08c57e --- /dev/null +++ b/docs/topics/impala_max_io_buffers.xml @@ -0,0 +1,28 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="max_io_buffers"> + + <title>MAX_IO_BUFFERS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Deprecated Features"/> + </metadata> + </prolog> + + <conbody> + + <p> + Deprecated query option. Currently has no effect. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 0 + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_max_scan_range_length.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max_scan_range_length.xml b/docs/topics/impala_max_scan_range_length.xml new file mode 100644 index 0000000..a790fc7 --- /dev/null +++ b/docs/topics/impala_max_scan_range_length.xml @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="max_scan_range_length"> + + <title>MAX_SCAN_RANGE_LENGTH Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">MAX_SCAN_RANGE_LENGTH query option</indexterm> + Maximum length of the scan range. Interacts with the number of HDFS blocks in the table to determine how many + CPU cores across the cluster are involved with the processing for a query. (Each core processes one scan + range.) + </p> + + <p> + Lowering the value can sometimes increase parallelism if you have unused CPU capacity, but a too-small value + can limit query performance because each scan range involves extra overhead. + </p> + + <p> + Only applicable to HDFS tables. Has no effect on Parquet tables. Unspecified or 0 indicates backend default, + which is the same as the HDFS block size for each table. + </p> + + <p> + Although the scan range can be arbitrarily long, Impala internally uses an 8 MB read buffer so that it can + query tables with huge block sizes without allocating equivalent blocks of memory. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 0 + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_mem_limit.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_mem_limit.xml b/docs/topics/impala_mem_limit.xml new file mode 100644 index 0000000..fd12953 --- /dev/null +++ b/docs/topics/impala_mem_limit.xml @@ -0,0 +1,208 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="mem_limit"> + + <title>MEM_LIMIT Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Scalability"/> + <data name="Category" value="Memory"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">MEM_LIMIT query option</indexterm> + When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. + Therefore, the total memory that can be used by a query is the <codeph>MEM_LIMIT</codeph> times the number of nodes. + </p> + + <p rev="CDH-32135"> + There are two levels of memory limit for Impala. + The <codeph>-mem_limit</codeph> startup option sets an overall limit for the <cmdname>impalad</cmdname> process + (which handles multiple queries concurrently). + That limit is typically expressed in terms of a percentage of the RAM available on the host, such as <codeph>-mem_limit=70%</codeph>. + The <codeph>MEM_LIMIT</codeph> query option, which you set through <cmdname>impala-shell</cmdname> + or the <codeph>SET</codeph> statement in a JDBC or ODBC application, applies to each individual query. + The <codeph>MEM_LIMIT</codeph> query option is usually expressed as a fixed size such as <codeph>10gb</codeph>, + and must always be less than the <cmdname>impalad</cmdname> memory limit. + </p> + + <p rev="CDH-32135"> + If query processing exceeds the specified memory limit on any node, either the per-query limit or the + <cmdname>impalad</cmdname> limit, Impala cancels the query automatically. + Memory limits are checked periodically during query processing, so the actual memory in use + might briefly exceed the limit without the query being cancelled. + </p> + + <p> + When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the + automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the + query does not proceed until that much memory is available. The actual memory used by the query could be + lower, since some queries use much less memory than others. With resource management, the + <codeph>MEM_LIMIT</codeph> setting acts both as a hard limit on the amount of memory a query can use on any + node (enforced by YARN) and a guarantee that that much memory will be available on each node while the query + is being executed. When resource management is enabled but no <codeph>MEM_LIMIT</codeph> setting is + specified, Impala estimates the amount of memory needed on each node for each query, requests that much + memory from YARN before starting the query, and then internally sets the <codeph>MEM_LIMIT</codeph> on each + node to the requested amount of memory during the query. Thus, if the query takes more memory than was + originally estimated, Impala detects that the <codeph>MEM_LIMIT</codeph> is exceeded and cancels the query + itself. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p rev="CDH-32135"> + <b>Units:</b> A numeric argument represents memory size in bytes; you can also use a suffix of <codeph>m</codeph> or <codeph>mb</codeph> + for megabytes, or more commonly <codeph>g</codeph> or <codeph>gb</codeph> for gigabytes. If you specify a value with unrecognized + formats, subsequent queries fail with an error. + </p> + + <p rev="CDH-32135"> + <b>Default:</b> 0 (unlimited) + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p rev="CDH-32135"> + The <codeph>MEM_LIMIT</codeph> setting is primarily useful in a high-concurrency setting, + or on a cluster with a workload shared between Impala and other data processing components. + You can prevent any query from accidentally using much more memory than expected, + which could negatively impact other Impala queries. + </p> + + <p rev="CDH-32135"> + Use the output of the <codeph>SUMMARY</codeph> command in <cmdname>impala-shell</cmdname> + to get a report of memory used for each phase of your most heavyweight queries on each node, + and then set a <codeph>MEM_LIMIT</codeph> somewhat higher than that. + See <xref href="impala_explain_plan.xml#perf_summary"/> for usage information about + the <codeph>SUMMARY</codeph> command. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb" rev="CDH-32135"/> + + <p rev="CDH-32135"> + The following examples show how to set the <codeph>MEM_LIMIT</codeph> query option + using a fixed number of bytes, or suffixes representing gigabytes or megabytes. + </p> + +<codeblock rev="CDH-32135"> +[localhost:21000] > set mem_limit=3000000000; +MEM_LIMIT set to 3000000000 +[localhost:21000] > select 5; +Query: select 5 ++---+ +| 5 | ++---+ +| 5 | ++---+ + +[localhost:21000] > set mem_limit=3g; +MEM_LIMIT set to 3g +[localhost:21000] > select 5; +Query: select 5 ++---+ +| 5 | ++---+ +| 5 | ++---+ + +[localhost:21000] > set mem_limit=3gb; +MEM_LIMIT set to 3gb +[localhost:21000] > select 5; ++---+ +| 5 | ++---+ +| 5 | ++---+ + +[localhost:21000] > set mem_limit=3m; +MEM_LIMIT set to 3m +[localhost:21000] > select 5; ++---+ +| 5 | ++---+ +| 5 | ++---+ +[localhost:21000] > set mem_limit=3mb; +MEM_LIMIT set to 3mb +[nightly55-2.vpc.cloudera.com:21000] > select 5; ++---+ +| 5 | ++---+ +</codeblock> + + <p rev="CDH-32135"> + The following examples show how unrecognized <codeph>MEM_LIMIT</codeph> + values lead to errors for subsequent queries. + </p> + +<codeblock rev="CDH-32135"> +[localhost:21000] > set mem_limit=3tb; +MEM_LIMIT set to 3tb +[localhost:21000] > select 5; +ERROR: Failed to parse query memory limit from '3tb'. + +[localhost:21000] > set mem_limit=xyz; +MEM_LIMIT set to xyz +[localhost:21000] > select 5; +Query: select 5 +ERROR: Failed to parse query memory limit from 'xyz'. +</codeblock> + + <p rev="CDH-32135"> + The following examples shows the automatic query cancellation + when the <codeph>MEM_LIMIT</codeph> value is exceeded + on any host involved in the Impala query. First it runs a + successful query and checks the largest amount of memory + used on any node for any stage of the query. + Then it sets an artificially low <codeph>MEM_LIMIT</codeph> + setting so that the same query cannot run. + </p> + +<codeblock rev="CDH-32135"> +[localhost:21000] > select count(*) from customer; +Query: select count(*) from customer ++----------+ +| count(*) | ++----------+ +| 150000 | ++----------+ + +[localhost:21000] > select count(distinct c_name) from customer; +Query: select count(distinct c_name) from customer ++------------------------+ +| count(distinct c_name) | ++------------------------+ +| 150000 | ++------------------------+ + +[localhost:21000] > summary; ++--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+ +| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | ++--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+ +| 06:AGGREGATE | 1 | 230.00ms | 230.00ms | 1 | 1 | 16.00 KB | -1 B | FINALIZE | +| 05:EXCHANGE | 1 | 43.44us | 43.44us | 1 | 1 | 0 B | -1 B | UNPARTITIONED | +| 02:AGGREGATE | 1 | 227.14ms | 227.14ms | 1 | 1 | 12.00 KB | 10.00 MB | | +| 04:AGGREGATE | 1 | 126.27ms | 126.27ms | 150.00K | 150.00K | 15.17 MB | 10.00 MB | | +| 03:EXCHANGE | 1 | 44.07ms | 44.07ms | 150.00K | 150.00K | 0 B | 0 B | HASH(c_name) | +<b>| 01:AGGREGATE | 1 | 361.94ms | 361.94ms | 150.00K | 150.00K | 23.04 MB | 10.00 MB | |</b> +| 00:SCAN HDFS | 1 | 43.64ms | 43.64ms | 150.00K | 150.00K | 24.19 MB | 64.00 MB | tpch.customer | ++--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+ + +[localhost:21000] > set mem_limit=15mb; +MEM_LIMIT set to 15mb +[localhost:21000] > select count(distinct c_name) from customer; +Query: select count(distinct c_name) from customer +ERROR: +Memory limit exceeded +Query did not have enough memory to get the minimum required buffers in the block manager. +</codeblock> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_min.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_min.xml b/docs/topics/impala_min.xml new file mode 100644 index 0000000..a63fc4c --- /dev/null +++ b/docs/topics/impala_min.xml @@ -0,0 +1,191 @@ +<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="min"> + + <title>MIN Function</title> + <titlealts><navtitle>MIN</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Analytic Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">min() function</indexterm> + An aggregate function that returns the minimum value from a set of numbers. Opposite of the + <codeph>MAX</codeph> function. Its single argument can be numeric column, or the numeric result of a function + or expression applied to the column value. Rows with a <codeph>NULL</codeph> value for the specified column + are ignored. If the table is empty, or all the values supplied to <codeph>MIN</codeph> are + <codeph>NULL</codeph>, <codeph>MIN</codeph> returns <codeph>NULL</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>MIN([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock> + + <p> + When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of + grouping values. + </p> + + <p conref="../shared/impala_common.xml#common/restrictions_sliding_window"/> + + <p conref="../shared/impala_common.xml#common/return_type_same_except_string"/> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + +<codeblock>-- Find the smallest value for this column in the table. +select min(c1) from t1; +-- Find the smallest value for this column from a subset of the table. +select min(c1) from t1 where month = 'January' and year = '2013'; +-- Find the smallest value from a set of numeric function results. +select min(length(s)) from t1; +-- Can also be used in combination with DISTINCT and/or GROUP BY. +-- Return more than one result. +select month, year, min(purchase_price) from store_stats group by month, year; +-- Filter the input to eliminate duplicates before performing the calculation. +select min(distinct x) from t1; +</codeblock> + + <p rev="2.0.0"> + The following examples show how to use <codeph>MIN()</codeph> in an analytic context. They use a table + containing integers from 1 to 10. Notice how the <codeph>MIN()</codeph> is reported for each input value, as + opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set. +<codeblock>select x, property, min(x) over (partition by property) as min from int_t where property in ('odd','even'); ++----+----------+-----+ +| x | property | min | ++----+----------+-----+ +| 2 | even | 2 | +| 4 | even | 2 | +| 6 | even | 2 | +| 8 | even | 2 | +| 10 | even | 2 | +| 1 | odd | 1 | +| 3 | odd | 1 | +| 5 | odd | 1 | +| 7 | odd | 1 | +| 9 | odd | 1 | ++----+----------+-----+ +</codeblock> + +Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving +set of rows (the <q>window</q>). The following examples use <codeph>MIN()</codeph> in an analytic context +(that is, with an <codeph>OVER()</codeph> clause) to display the smallest value of <codeph>X</codeph> +encountered up to each row in the result set. The examples use two columns in the <codeph>ORDER BY</codeph> +clause to produce a sequence of values that rises and falls, to illustrate how the <codeph>MIN()</codeph> +result only decreases or stays the same throughout each partition within the result set. +The basic <codeph>ORDER BY x</codeph> clause implicitly +activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, +therefore all of these examples produce the same results: + +<codeblock>select x, property, min(x) <b>over (order by property, x desc)</b> as 'minimum to this point' + from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | minimum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 5 | +| 3 | prime | 3 | +| 2 | prime | 2 | +| 9 | square | 2 | +| 4 | square | 2 | +| 1 | square | 1 | ++---+----------+-----------------------+ + +select x, property, + min(x) over + ( + <b>order by property, x desc</b> + <b>range between unbounded preceding and current row</b> + ) as 'minimum to this point' +from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | minimum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 5 | +| 3 | prime | 3 | +| 2 | prime | 2 | +| 9 | square | 2 | +| 4 | square | 2 | +| 1 | square | 1 | ++---+----------+-----------------------+ + +select x, property, + min(x) over + ( + <b>order by property, x desc</b> + <b>rows between unbounded preceding and current row</b> + ) as 'minimum to this point' +from int_t where property in ('prime','square'); ++---+----------+-----------------------+ +| x | property | minimum to this point | ++---+----------+-----------------------+ +| 7 | prime | 7 | +| 5 | prime | 5 | +| 3 | prime | 3 | +| 2 | prime | 2 | +| 9 | square | 2 | +| 4 | square | 2 | +| 1 | square | 1 | ++---+----------+-----------------------+ +</codeblock> + +The following examples show how to construct a moving window, with a running minimum taking into account all rows before +and 1 row after the current row. +Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of +moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph> clause. +Because of an extra Impala restriction on the <codeph>MAX()</codeph> and <codeph>MIN()</codeph> functions in an +analytic context, the lower bound must be <codeph>UNBOUNDED PRECEDING</codeph>. +<codeblock>select x, property, + min(x) over + ( + <b>order by property, x desc</b> + <b>rows between unbounded preceding and 1 following</b> + ) as 'local minimum' +from int_t where property in ('prime','square'); ++---+----------+---------------+ +| x | property | local minimum | ++---+----------+---------------+ +| 7 | prime | 5 | +| 5 | prime | 3 | +| 3 | prime | 2 | +| 2 | prime | 2 | +| 9 | square | 2 | +| 4 | square | 1 | +| 1 | square | 1 | ++---+----------+---------------+ + +-- Doesn't work because of syntax restriction on RANGE clause. +select x, property, + min(x) over + ( + <b>order by property, x desc</b> + <b>range between unbounded preceding and 1 following</b> + ) as 'local minimum' +from int_t where property in ('prime','square'); +ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW. +</codeblock> + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_analytic_functions.xml#analytic_functions"/>, <xref href="impala_max.xml#max"/>, + <xref href="impala_avg.xml#avg"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_misc_functions.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_misc_functions.xml b/docs/topics/impala_misc_functions.xml new file mode 100644 index 0000000..bb9f062 --- /dev/null +++ b/docs/topics/impala_misc_functions.xml @@ -0,0 +1,148 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="misc_functions"> + + <title>Impala Miscellaneous Functions</title> + <titlealts><navtitle>Miscellaneous Functions</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala supports the following utility functions that do not operate on a particular column or data type: + </p> + + <dl> + <dlentry rev="1.3.0" id="current_database"> + + <dt> + <codeph>current_database()</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">current_database() function</indexterm> + <b>Purpose:</b> Returns the database that the session is currently using, either <codeph>default</codeph> + if no database has been selected, or whatever database the session switched to through a + <codeph>USE</codeph> statement or the <cmdname>impalad</cmdname><codeph>-d</codeph> option. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry rev="5.4.5" id="effective_user"> + + <dt> + <codeph>effective_user()</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">effective_user() function</indexterm> + <b>Purpose:</b> Typically returns the same value as <codeph>user()</codeph>, + except if delegation is enabled, in which case it returns the ID of the delegated user. + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + <p> + <b>Added in:</b> CDH 5.4.5 + </p> + </dd> + + </dlentry> + + <dlentry rev="1.3.0" id="pid"> + + <dt> + <codeph>pid()</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">pid() function</indexterm> + <b>Purpose:</b> Returns the process ID of the <cmdname>impalad</cmdname> daemon that the session is + connected to. You can use it during low-level debugging, to issue Linux commands that trace, show the + arguments, and so on the <cmdname>impalad</cmdname> process. + <p> + <b>Return type:</b> <codeph>int</codeph> + </p> + </dd> + + </dlentry> + + <dlentry audience="Cloudera" id="sleep"> + + <dt> + <codeph>sleep(int ms)</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">sleep() function</indexterm> + <b>Purpose:</b> Pauses the query for a specified number of milliseconds. For slowing down queries with + small result sets enough to monitor runtime execution, memory usage, or other factors that otherwise + would be difficult to capture during the brief interval of query execution. When used in the + <codeph>SELECT</codeph> list, it is called once for each row in the result set; adjust the number of + milliseconds accordingly. For example, a query <codeph>SELECT *, sleep(5) FROM + table_with_1000_rows</codeph> would take at least 5 seconds to complete (5 milliseconds * 1000 rows in + result set). To avoid an excessive number of concurrent queries, use this function for troubleshooting on + test and development systems, not for production queries. + <p> + <b>Return type:</b> N/A + </p> + </dd> + + </dlentry> + + <dlentry rev="1.1" id="user"> + + <dt> + <codeph>user()</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">user() function</indexterm> + <b>Purpose:</b> Returns the username of the Linux user who is connected to the <cmdname>impalad</cmdname> + daemon. Typically called a single time, in a query without any <codeph>FROM</codeph> clause, to + understand how authorization settings apply in a security context; once you know the logged-in user name, + you can check which groups that user belongs to, and from the list of groups you can check which roles + are available to those groups through the authorization policy file. + <p conref="../shared/impala_common.xml#common/user_kerberized"/> + <p> + When delegation is enabled, consider calling the <codeph>effective_user()</codeph> function instead. + </p> + <p> + <b>Return type:</b> <codeph>string</codeph> + </p> + </dd> + + </dlentry> + + <dlentry id="version"> + + <dt> + <codeph>version()</codeph> + </dt> + + <dd> + <indexterm audience="Cloudera">version() function</indexterm> + <b>Purpose:</b> Returns information such as the precise version number and build date for the + <codeph>impalad</codeph> daemon that you are currently connected to. Typically used to confirm that you + are connected to the expected level of Impala to use a particular feature, or to connect to several nodes + and confirm they are all running the same level of <cmdname>impalad</cmdname>. + <p> + <b>Return type:</b> <codeph>string</codeph> (with one or more embedded newlines) + </p> + </dd> + + </dlentry> + </dl> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_ndv.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_ndv.xml b/docs/topics/impala_ndv.xml new file mode 100644 index 0000000..a1e5527 --- /dev/null +++ b/docs/topics/impala_ndv.xml @@ -0,0 +1,133 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="ndv"> + + <title>NDV Function</title> + <titlealts><navtitle>NDV</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="Aggregate Functions"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">ndv() function</indexterm> + An aggregate function that returns an approximate value similar to the result of <codeph>COUNT(DISTINCT + <varname>col</varname>)</codeph>, the <q>number of distinct values</q>. It is much faster than the + combination of <codeph>COUNT</codeph> and <codeph>DISTINCT</codeph>, and uses a constant amount of memory and + thus is less memory-intensive for columns with high cardinality. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>NDV([DISTINCT | ALL] <varname>expression</varname>)</codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p rev="1.2.2"> + This is the mechanism used internally by the <codeph>COMPUTE STATS</codeph> statement for computing the + number of distinct values in a column. + </p> + + <p> + Because this number is an estimate, it might not reflect the precise number of different values in the + column, especially if the cardinality is very low or very high. If the estimated number is higher than the + number of rows in the table, Impala adjusts the value internally during query planning. + </p> + + <p conref="../shared/impala_common.xml#common/former_odd_return_type_string"/> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/restrictions_sliding_window"/> --> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/> + + <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/> + + <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> + + <p conref="../shared/impala_common.xml#common/analytic_not_allowed_caveat"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example queries a billion-row table to illustrate the relative performance of + <codeph>COUNT(DISTINCT)</codeph> and <codeph>NDV()</codeph>. It shows how <codeph>COUNT(DISTINCT)</codeph> + gives a precise answer, but is inefficient for large-scale data where an approximate result is sufficient. + The <codeph>NDV()</codeph> function gives an approximate result but is much faster. + </p> + +<codeblock>select count(distinct col1) from sample_data; ++---------------------+ +| count(distinct col1)| ++---------------------+ +| 100000 | ++---------------------+ +Fetched 1 row(s) in 20.13s + +select cast(ndv(col1) as bigint) as col1 from sample_data; ++----------+ +| col1 | ++----------+ +| 139017 | ++----------+ +Fetched 1 row(s) in 8.91s +</codeblock> + + <p> + The following example shows how you can code multiple <codeph>NDV()</codeph> calls in a single query, to + easily learn which columns have substantially more or fewer distinct values. This technique is faster than + running a sequence of queries with <codeph>COUNT(DISTINCT)</codeph> calls. + </p> + +<codeblock>select cast(ndv(col1) as bigint) as col1, cast(ndv(col2) as bigint) as col2, + cast(ndv(col3) as bigint) as col3, cast(ndv(col4) as bigint) as col4 + from sample_data; ++----------+-----------+------------+-----------+ +| col1 | col2 | col3 | col4 | ++----------+-----------+------------+-----------+ +| 139017 | 282 | 46 | 145636240 | ++----------+-----------+------------+-----------+ +Fetched 1 row(s) in 34.97s + +select count(distinct col1) from sample_data; ++---------------------+ +| count(distinct col1)| ++---------------------+ +| 100000 | ++---------------------+ +Fetched 1 row(s) in 20.13s + +select count(distinct col2) from sample_data; ++----------------------+ +| count(distinct col2) | ++----------------------+ +| 278 | ++----------------------+ +Fetched 1 row(s) in 20.09s + +select count(distinct col3) from sample_data; ++-----------------------+ +| count(distinct col3) | ++-----------------------+ +| 46 | ++-----------------------+ +Fetched 1 row(s) in 19.12s + +select count(distinct col4) from sample_data; ++----------------------+ +| count(distinct col4) | ++----------------------+ +| 147135880 | ++----------------------+ +Fetched 1 row(s) in 266.95s +</codeblock> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_num_nodes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_num_nodes.xml b/docs/topics/impala_num_nodes.xml new file mode 100644 index 0000000..75ae8e8 --- /dev/null +++ b/docs/topics/impala_num_nodes.xml @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="num_nodes"> + + <title>NUM_NODES Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">NUM_NODES query option</indexterm> + Limit the number of nodes that process a query, typically during debugging. + + </p> + + <p> + <b>Type:</b> numeric + </p> + +<p> + <b>Allowed values:</b> Only accepts the values 0 + (meaning all nodes) or 1 (meaning all work is done on the coordinator node). +</p> + + <p> + <b>Default:</b> 0 + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + If you are diagnosing a problem that you suspect is due to a timing issue due to + distributed query processing, you can set <codeph>NUM_NODES=1</codeph> to verify + if the problem still occurs when all the work is done on a single node. + </p> + + <p conref="../shared/impala_common.xml#common/num_nodes_tip"/> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_num_scanner_threads.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_num_scanner_threads.xml b/docs/topics/impala_num_scanner_threads.xml new file mode 100644 index 0000000..27cf883 --- /dev/null +++ b/docs/topics/impala_num_scanner_threads.xml @@ -0,0 +1,32 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="num_scanner_threads"> + + <title>NUM_SCANNER_THREADS Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">NUM_SCANNER_THREADS query option</indexterm> + Maximum number of scanner threads (on each node) used for each query. By default, Impala uses as many cores + as are available (one thread per core). You might lower this value if queries are using excessive resources + on a busy cluster. Impala imposes a maximum value automatically, so a high value has no practical effect. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 0 + </p> + + <note conref="../shared/impala_common.xml#common/compute_stats_parquet"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_offset.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_offset.xml b/docs/topics/impala_offset.xml new file mode 100644 index 0000000..c9c073d --- /dev/null +++ b/docs/topics/impala_offset.xml @@ -0,0 +1,64 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2.1" id="offset"> + + <title>OFFSET Clause</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Reports"/> + </metadata> + </prolog> + + <conbody> + + <p> + The <codeph>OFFSET</codeph> clause in a <codeph>SELECT</codeph> query causes the result set to start some + number of rows after the logical first item. The result set is numbered starting from zero, so <codeph>OFFSET + 0</codeph> produces the same result as leaving out the <codeph>OFFSET</codeph> clause. Always use this clause + in combination with <codeph>ORDER BY</codeph> (so that it is clear which item should be first, second, and so + on) and <codeph>LIMIT</codeph> (so that the result set covers a bounded range, such as items 0-9, 100-199, + and so on). + </p> + + <p conref="../shared/impala_common.xml#common/limit_and_offset"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how you could run a <q>paging</q> query originally written for a traditional + database application. Because typical Impala queries process megabytes or gigabytes of data and read large + data files from disk each time, it is inefficient to run a separate query to retrieve each small group of + items. Use this technique only for compatibility while porting older applications, then rewrite the + application code to use a single query with a large result set, and display pages of results from the cached + result set. + </p> + +<codeblock>[localhost:21000] > create table numbers (x int); +[localhost:21000] > insert into numbers select x from very_long_sequence; +Inserted 1000000 rows in 1.34s +[localhost:21000] > select x from numbers order by x limit 5 offset 0; ++----+ +| x | ++----+ +| 1 | +| 2 | +| 3 | +| 4 | +| 5 | ++----+ +[localhost:21000] > select x from numbers order by x limit 5 offset 5; ++----+ +| x | ++----+ +| 6 | +| 7 | +| 8 | +| 9 | +| 10 | ++----+ +</codeblock> + </conbody> +</concept>
