Repository: incubator-impala Updated Branches: refs/heads/master cb52b2b8a -> 2a314e780
IMPALA-1654: [DOCS] DDL for multiple partitions Syntax and usage notes for ALTER TABLE, COMPUTE STATS, and SHOW FILES. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Added note about performance/scalability of IMPALA-1654. Added new Known Issue item for IMPALA-4106 under Performance category. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Reviewed-on: http://gerrit.cloudera.org:8080/5726 Reviewed-by: Alex Behm <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/eaefbb90 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/eaefbb90 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/eaefbb90 Branch: refs/heads/master Commit: eaefbb90cefc65fc7a0520b8a18f8bf4d03d99ad Parents: cb52b2b Author: John Russell <[email protected]> Authored: Tue Jan 17 13:19:13 2017 -0800 Committer: Impala Public Jenkins <[email protected]> Committed: Wed Jan 25 18:39:12 2017 +0000 ---------------------------------------------------------------------- docs/topics/impala_alter_table.xml | 80 ++++++++++++++++++++++++++++++- docs/topics/impala_compute_stats.xml | 76 ++++++++++++++++++++++++++++- docs/topics/impala_known_issues.xml | 18 +++++++ docs/topics/impala_show.xml | 15 +++++- 4 files changed, 186 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_alter_table.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_alter_table.xml b/docs/topics/impala_alter_table.xml index 5065319..950ebc6 100644 --- a/docs/topics/impala_alter_table.xml +++ b/docs/topics/impala_alter_table.xml @@ -80,7 +80,11 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph> <varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname> -<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname> +<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph> + +<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname> + +<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph> <varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...] @@ -124,6 +128,80 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph> an external table, the underlying data directory is not renamed or moved. </p> + <p> + <b>Dropping or altering multiple partitions:</b> + </p> + + <p rev="IMPALA-1654"> + In <keyword keyref="impala28_full"/> and higher, + the expression for the partition clause with a <codeph>DROP</codeph> or <codeph>SET</codeph> + operation can include comparison operators such as <codeph><</codeph>, <codeph>IN</codeph>, + or <codeph>BETWEEN</codeph>, and Boolean operators such as <codeph>AND</codeph> + and <codeph>OR</codeph>. + </p> + + <p rev="IMPALA-1654"> + For example, you might drop a group of partitions corresponding to a particular date + range after the data <q>ages out</q>: + </p> + +<codeblock><![CDATA[ +alter table historical_data drop partition (year < 1995); +alter table historical_data drop partition (year = 1996 and month between 1 and 6); +]]> +</codeblock> + + <p rev="IMPALA-1654"> + For tables with multiple partition keys columns, you can specify multiple + conditions separated by commas, and the operation only applies to the partitions + that match all the conditions (similar to using an <codeph>AND</codeph> clause): + </p> + +<codeblock><![CDATA[ +alter table historical_data drop partition (year < 1995, last_name like 'A%'); +]]> +</codeblock> + + <p rev="IMPALA-1654"> + This technique can also be used to change the file format of groups of partitions, + as part of an ETL pipeline that periodically consolidates and rewrites the underlying + data files in a different file format: + </p> + +<codeblock><![CDATA[ +alter table fast_growing_data partition (year = 2016, month in (10,11,12)) set fileformat parquet; +]]> +</codeblock> + + <note> + <p rev="IMPALA-1654"> + The extended syntax involving comparison operators and multiple partitions + applies to the <codeph>SET FILEFORMAT</codeph>, <codeph>SET TBLPROPERTIES</codeph>, + <codeph>SET SERDEPROPERTIES</codeph>, and <codeph>SET [UN]CACHED</codeph> clauses. + You can also use this syntax with the <codeph>PARTITION</codeph> clause + in the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and with the + <codeph>PARTITION</codeph> clause of the <codeph>SHOW FILES</codeph> statement. + Some forms of <codeph>ALTER TABLE</codeph> still only apply to one partition + at a time: the <codeph>SET LOCATION</codeph> and <codeph>ADD PARTITION</codeph> + clauses. The <codeph>PARTITION</codeph> clauses in the <codeph>LOAD DATA</codeph> + and <codeph>INSERT</codeph> statements also only apply to one partition at a time. + </p> + <p> + A DDL statement that applies to multiple partitions is considered successful + (resulting in no changes) even if no partitions match the conditions. + The results are the same as if the <codeph>IF EXISTS</codeph> clause was specified. + </p> + <p> + The performance and scalability of this technique is similar to + issuing a sequence of single-partition <codeph>ALTER TABLE</codeph> + statements in quick succession. To minimize bottlenecks due to + communication with the metastore database, or causing other + DDL operations on the same table to wait, test the effects of + performing <codeph>ALTER TABLE</codeph> statements that affect + large numbers of partitions. + </p> + </note> + <p conref="../shared/impala_common.xml#common/s3_blurb"/> <p rev="2.6.0 CDH-39913 IMPALA-1878"> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_compute_stats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_compute_stats.xml b/docs/topics/impala_compute_stats.xml index 8142da4..5a15c72 100644 --- a/docs/topics/impala_compute_stats.xml +++ b/docs/topics/impala_compute_stats.xml @@ -52,7 +52,12 @@ under the License. <codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname> COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)] -<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname> +<!-- Is kudu_partition_spec applicable here? --> +<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph> + +<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname> + +<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph> </codeblock> <p conref="../shared/impala_common.xml#common/incremental_partition_spec"/> @@ -108,6 +113,75 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn </ul> </note> + <p rev="IMPALA-1654"> + <b>Computing stats for groups of partitions:</b> + </p> + + <p rev="IMPALA-1654"> + In <keyword keyref="impala28_full"/> and higher, you can run <codeph>COMPUTE INCREMENTAL STATS</codeph> + on multiple partitions, instead of the entire table or one partition at a time. You include + comparison operators other than <codeph>=</codeph> in the <codeph>PARTITION</codeph> clause, + and the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement applies to all partitions that + match the comparison expression. + </p> + + <p rev="IMPALA-1654"> + For example, the <codeph>INT_PARTITIONS</codeph> table contains 4 partitions. + The following <codeph>COMPUTE INCREMENTAL STATS</codeph> statements affect some but not all + partitions, as indicated by the <codeph>Updated <varname>n</varname> partition(s)</codeph> + messages. The partitions that are affected depend on values in the partition key column <codeph>X</codeph> + that match the comparison expression in the <codeph>PARTITION</codeph> clause. + </p> + +<codeblock rev="IMPALA-1654"><![CDATA[ +show partitions int_partitions; ++-------+-------+--------+------+--------------+-------------------+---------+... +| x | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... ++-------+-------+--------+------+--------------+-------------------+---------+... +| 99 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET |... +| 120 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |... +| 150 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |... +| 200 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |... +| Total | -1 | 0 | 0B | 0B | | |... ++-------+-------+--------+------+--------------+-------------------+---------+... + +compute incremental stats int_partitions partition (x < 100); ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 1 partition(s) and 1 column(s). | ++-----------------------------------------+ + +compute incremental stats int_partitions partition (x in (100, 150, 200)); ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 2 partition(s) and 1 column(s). | ++-----------------------------------------+ + +compute incremental stats int_partitions partition (x between 100 and 175); ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 2 partition(s) and 1 column(s). | ++-----------------------------------------+ + +compute incremental stats int_partitions partition (x in (100, 150, 200) or x < 100); ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 3 partition(s) and 1 column(s). | ++-----------------------------------------+ + +compute incremental stats int_partitions partition (x != 150); ++-----------------------------------------+ +| summary | ++-----------------------------------------+ +| Updated 3 partition(s) and 1 column(s). | ++-----------------------------------------+ +]]> +</codeblock> + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> <p rev="2.3.0"> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_known_issues.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_known_issues.xml b/docs/topics/impala_known_issues.xml index 72c9b12..a910ef0 100644 --- a/docs/topics/impala_known_issues.xml +++ b/docs/topics/impala_known_issues.xml @@ -248,6 +248,24 @@ https://issues.cloudera.org/browse/IMPALA-2144 - Don't have </conbody> + <concept id="IMPALA-4106" rev="IMPALA-4106"> + <title>Use Hive Metastore bulk API for dropping multiple partitions.</title> + <conbody> + <p> + The bulk partition dropping and setting feature of IMPALA-1654 is not as efficient + as it could be, because it currently does not use the Hive Metastore bulk API. + </p> + <p><b>Bug:</b> <xref keyref="IMPALA-4106">IMPALA-4106</xref></p> + <p><b>Severity:</b> High</p> + <p><b>Workaround:</b> Schedule <codeph>ALTER TABLE</codeph> operations that touch + many partitions for times when the table is not undergoing any other DDL operations, + and be prepared for the table to be locked for some time while the <codeph>ALTER TABLE</codeph> + is in progress. Test the performance of large-scale partition operations in a development + environment before trying on tables in a production system. + </p> + </conbody> + </concept> + <concept id="IMPALA-1480" rev="IMPALA-1480"> <!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which mentions it's similar to this one but not a duplicate. --> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_show.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_show.xml b/docs/topics/impala_show.xml index c5a7740..84e9c0b 100644 --- a/docs/topics/impala_show.xml +++ b/docs/topics/impala_show.xml @@ -49,7 +49,7 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] '<varname>pattern</var <ph rev="1.2.1">SHOW TABLE STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph> <ph rev="1.2.1">SHOW COLUMN STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph> <ph rev="1.4.0">SHOW PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph> -<ph rev="2.2.0">SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col</varname>=<varname>value</varname> [, <varname>key_col</varname>=<varname>value</varname>]]</ph> +SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>] <ph rev="2.0.0">SHOW ROLES SHOW CURRENT ROLES @@ -113,6 +113,19 @@ show tables '*dim*|*fact*';</codeblock> <codeph>MB</codeph> for megabytes, and <codeph>GB</codeph> for gigabytes. </p> + <p rev="IMPALA-1654"> + In <keyword keyref="impala28_full"/> and higher, you can use general + expressions with operators such as <codeph><</codeph>, <codeph>IN</codeph>, + <codeph>LIKE</codeph>, and <codeph>BETWEEN</codeph> in the <codeph>PARTITION</codeph> + clause, instead of only equality operators. For example: +<codeblock><![CDATA[ +show files in sample_table partition (j < 5); +show files in sample_table partition (k = 3, l between 1 and 10); +show files in sample_table partition (month like 'J%'); +]]> +</codeblock> + </p> + <note> This statement applies to tables and partitions stored on HDFS, or in the Amazon Simple Storage System (S3). It does not apply to views.
