Repository: incubator-impala
Updated Branches:
  refs/heads/master cb52b2b8a -> 2a314e780


IMPALA-1654: [DOCS] DDL for multiple partitions

Syntax and usage notes for ALTER TABLE,
COMPUTE STATS, and SHOW FILES.

Mixed in a little bit with new Kudu syntax for
ALTER TABLE. Didn't include all new Kudu info
in this CR, the better to minimize merge conflicts.

Added note about performance/scalability of IMPALA-1654.

Added new Known Issue item for IMPALA-4106 under Performance category.

Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b
Reviewed-on: http://gerrit.cloudera.org:8080/5726
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/eaefbb90
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/eaefbb90
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/eaefbb90

Branch: refs/heads/master
Commit: eaefbb90cefc65fc7a0520b8a18f8bf4d03d99ad
Parents: cb52b2b
Author: John Russell <[email protected]>
Authored: Tue Jan 17 13:19:13 2017 -0800
Committer: Impala Public Jenkins <[email protected]>
Committed: Wed Jan 25 18:39:12 2017 +0000

----------------------------------------------------------------------
 docs/topics/impala_alter_table.xml   | 80 ++++++++++++++++++++++++++++++-
 docs/topics/impala_compute_stats.xml | 76 ++++++++++++++++++++++++++++-
 docs/topics/impala_known_issues.xml  | 18 +++++++
 docs/topics/impala_show.xml          | 15 +++++-
 4 files changed, 186 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_alter_table.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_alter_table.xml 
b/docs/topics/impala_alter_table.xml
index 5065319..950ebc6 100644
--- a/docs/topics/impala_alter_table.xml
+++ b/docs/topics/impala_alter_table.xml
@@ -80,7 +80,11 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
 
 <varname>col_spec</varname> ::= <varname>col_name</varname> 
<varname>type_name</varname>
 
-<varname>partition_spec</varname> ::= 
<varname>partition_col</varname>=<varname>constant_value</varname>
+<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> 
| <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph 
rev="kudu"><varname>kudu_partition_spec</varname></ph>
+
+<varname>simple_partition_spec</varname> ::= 
<varname>partition_col</varname>=<varname>constant_value</varname>
+
+<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= 
<varname>comparison_expression_on_partition_col</varname></ph>
 
 <varname>table_properties</varname> ::= 
'<varname>name</varname>'='<varname>value</varname>'[, 
'<varname>name</varname>'='<varname>value</varname>' ...]
 
@@ -124,6 +128,80 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
       an external table, the underlying data directory is not renamed or moved.
     </p>
 
+    <p>
+      <b>Dropping or altering multiple partitions:</b>
+    </p>
+
+    <p rev="IMPALA-1654">
+      In <keyword keyref="impala28_full"/> and higher,
+      the expression for the partition clause with a <codeph>DROP</codeph> or 
<codeph>SET</codeph>
+      operation can include comparison operators such as 
<codeph>&lt;</codeph>, <codeph>IN</codeph>,
+      or <codeph>BETWEEN</codeph>, and Boolean operators such as 
<codeph>AND</codeph>
+      and <codeph>OR</codeph>.
+    </p>
+
+    <p rev="IMPALA-1654">
+      For example, you might drop a group of partitions corresponding to a 
particular date
+      range after the data <q>ages out</q>:
+    </p>
+
+<codeblock><![CDATA[
+alter table historical_data drop partition (year < 1995);
+alter table historical_data drop partition (year = 1996 and month between 1 
and 6);
+]]>
+</codeblock>
+
+    <p rev="IMPALA-1654">
+      For tables with multiple partition keys columns, you can specify multiple
+      conditions separated by commas, and the operation only applies to the 
partitions
+      that match all the conditions (similar to using an <codeph>AND</codeph> 
clause):
+    </p>
+
+<codeblock><![CDATA[
+alter table historical_data drop partition (year < 1995, last_name like 'A%');
+]]>
+</codeblock>
+
+    <p rev="IMPALA-1654">
+      This technique can also be used to change the file format of groups of 
partitions,
+      as part of an ETL pipeline that periodically consolidates and rewrites 
the underlying
+      data files in a different file format:
+    </p>
+
+<codeblock><![CDATA[
+alter table fast_growing_data partition (year = 2016, month in (10,11,12)) set 
fileformat parquet;
+]]>
+</codeblock>
+
+    <note>
+      <p rev="IMPALA-1654">
+        The extended syntax involving comparison operators and multiple 
partitions
+        applies to the <codeph>SET FILEFORMAT</codeph>, <codeph>SET 
TBLPROPERTIES</codeph>,
+        <codeph>SET SERDEPROPERTIES</codeph>, and <codeph>SET 
[UN]CACHED</codeph> clauses.
+        You can also use this syntax with the <codeph>PARTITION</codeph> clause
+        in the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and with 
the
+        <codeph>PARTITION</codeph> clause of the <codeph>SHOW FILES</codeph> 
statement.
+        Some forms of <codeph>ALTER TABLE</codeph> still only apply to one 
partition
+        at a time: the <codeph>SET LOCATION</codeph> and <codeph>ADD 
PARTITION</codeph>
+        clauses. The <codeph>PARTITION</codeph> clauses in the <codeph>LOAD 
DATA</codeph>
+        and <codeph>INSERT</codeph> statements also only apply to one 
partition at a time.
+      </p>
+      <p>
+        A DDL statement that applies to multiple partitions is considered 
successful
+        (resulting in no changes) even if no partitions match the conditions.
+        The results are the same as if the <codeph>IF EXISTS</codeph> clause 
was specified.
+      </p>
+      <p>
+        The performance and scalability of this technique is similar to
+        issuing a sequence of single-partition <codeph>ALTER TABLE</codeph>
+        statements in quick succession. To minimize bottlenecks due to
+        communication with the metastore database, or causing other
+        DDL operations on the same table to wait, test the effects of
+        performing <codeph>ALTER TABLE</codeph> statements that affect
+        large numbers of partitions.
+      </p>
+    </note>
+
     <p conref="../shared/impala_common.xml#common/s3_blurb"/>
 
     <p rev="2.6.0 CDH-39913 IMPALA-1878">

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_compute_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_compute_stats.xml 
b/docs/topics/impala_compute_stats.xml
index 8142da4..5a15c72 100644
--- a/docs/topics/impala_compute_stats.xml
+++ b/docs/topics/impala_compute_stats.xml
@@ -52,7 +52,12 @@ under the License.
 <codeblock rev="2.1.0">COMPUTE STATS 
[<varname>db_name</varname>.]<varname>table_name</varname>
 COMPUTE INCREMENTAL STATS 
[<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION 
(<varname>partition_spec</varname>)]
 
-<varname>partition_spec</varname> ::= 
<varname>partition_col</varname>=<varname>constant_value</varname>
+<!-- Is kudu_partition_spec applicable here? -->
+<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> 
| <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph 
rev="kudu"><varname>kudu_partition_spec</varname></ph>
+
+<varname>simple_partition_spec</varname> ::= 
<varname>partition_col</varname>=<varname>constant_value</varname>
+
+<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= 
<varname>comparison_expression_on_partition_col</varname></ph>
 </codeblock>
 
     <p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
@@ -108,6 +113,75 @@ COMPUTE INCREMENTAL STATS 
[<varname>db_name</varname>.]<varname>table_name</varn
       </ul>
     </note>
 
+    <p rev="IMPALA-1654">
+      <b>Computing stats for groups of partitions:</b>
+    </p>
+
+    <p rev="IMPALA-1654">
+      In <keyword keyref="impala28_full"/> and higher, you can run 
<codeph>COMPUTE INCREMENTAL STATS</codeph>
+      on multiple partitions, instead of the entire table or one partition at 
a time. You include
+      comparison operators other than <codeph>=</codeph> in the 
<codeph>PARTITION</codeph> clause,
+      and the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement applies to 
all partitions that
+      match the comparison expression.
+    </p>
+
+    <p rev="IMPALA-1654">
+      For example, the <codeph>INT_PARTITIONS</codeph> table contains 4 
partitions.
+      The following <codeph>COMPUTE INCREMENTAL STATS</codeph> statements 
affect some but not all
+      partitions, as indicated by the <codeph>Updated <varname>n</varname> 
partition(s)</codeph>
+      messages. The partitions that are affected depend on values in the 
partition key column <codeph>X</codeph>
+      that match the comparison expression in the <codeph>PARTITION</codeph> 
clause.
+    </p>
+
+<codeblock rev="IMPALA-1654"><![CDATA[
+show partitions int_partitions;
++-------+-------+--------+------+--------------+-------------------+---------+...
+| x     | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format  
|...
++-------+-------+--------+------+--------------+-------------------+---------+...
+| 99    | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | PARQUET 
|...
+| 120   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    
|...
+| 150   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    
|...
+| 200   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    
|...
+| Total | -1    | 0      | 0B   | 0B           |                   |         
|...
++-------+-------+--------+------+--------------+-------------------+---------+...
+
+compute incremental stats int_partitions partition (x < 100);
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+
+compute incremental stats int_partitions partition (x in (100, 150, 200));
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 2 partition(s) and 1 column(s). |
++-----------------------------------------+
+
+compute incremental stats int_partitions partition (x between 100 and 175);
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 2 partition(s) and 1 column(s). |
++-----------------------------------------+
+
+compute incremental stats int_partitions partition (x in (100, 150, 200) or x 
< 100);
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
++-----------------------------------------+
+
+compute incremental stats int_partitions partition (x != 150);
++-----------------------------------------+
+| summary                                 |
++-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
++-----------------------------------------+
+]]>
+</codeblock>
+
     <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
 
     <p rev="2.3.0">

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_known_issues.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_known_issues.xml 
b/docs/topics/impala_known_issues.xml
index 72c9b12..a910ef0 100644
--- a/docs/topics/impala_known_issues.xml
+++ b/docs/topics/impala_known_issues.xml
@@ -248,6 +248,24 @@ https://issues.cloudera.org/browse/IMPALA-2144 - Don't have
 
     </conbody>
 
+    <concept id="IMPALA-4106" rev="IMPALA-4106">
+      <title>Use Hive Metastore bulk API for dropping multiple 
partitions.</title>
+      <conbody>
+        <p>
+          The bulk partition dropping and setting feature of IMPALA-1654 is 
not as efficient
+          as it could be, because it currently does not use the Hive Metastore 
bulk API.
+        </p>
+        <p><b>Bug:</b> <xref keyref="IMPALA-4106">IMPALA-4106</xref></p>
+        <p><b>Severity:</b> High</p>
+        <p><b>Workaround:</b> Schedule <codeph>ALTER TABLE</codeph> operations 
that touch
+          many partitions for times when the table is not undergoing any other 
DDL operations,
+          and be prepared for the table to be locked for some time while the 
<codeph>ALTER TABLE</codeph>
+          is in progress. Test the performance of large-scale partition 
operations in a development
+          environment before trying on tables in a production system.
+        </p>
+      </conbody>
+    </concept>
+
     <concept id="IMPALA-1480" rev="IMPALA-1480">
 
 <!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which 
mentions it's similar to this one but not a duplicate. -->

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/eaefbb90/docs/topics/impala_show.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_show.xml b/docs/topics/impala_show.xml
index c5a7740..84e9c0b 100644
--- a/docs/topics/impala_show.xml
+++ b/docs/topics/impala_show.xml
@@ -49,7 +49,7 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] 
'<varname>pattern</var
 <ph rev="1.2.1">SHOW TABLE STATS 
[<varname>database_name</varname>.]<varname>table_name</varname></ph>
 <ph rev="1.2.1">SHOW COLUMN STATS 
[<varname>database_name</varname>.]<varname>table_name</varname></ph>
 <ph rev="1.4.0">SHOW PARTITIONS 
[<varname>database_name</varname>.]<varname>table_name</varname></ph>
-<ph rev="2.2.0">SHOW FILES IN 
[<varname>database_name</varname>.]<varname>table_name</varname> [PARTITION 
(<varname>key_col</varname>=<varname>value</varname> [, 
<varname>key_col</varname>=<varname>value</varname>]]</ph>
+SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> 
<ph  rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, 
<varname>key_col_expression</varname>]</ph>]
 
 <ph rev="2.0.0">SHOW ROLES
 SHOW CURRENT ROLES
@@ -113,6 +113,19 @@ show tables '*dim*|*fact*';</codeblock>
         <codeph>MB</codeph> for megabytes, and <codeph>GB</codeph> for 
gigabytes.
       </p>
 
+      <p rev="IMPALA-1654">
+        In <keyword keyref="impala28_full"/> and higher, you can use general
+        expressions with operators such as <codeph>&lt;</codeph>, 
<codeph>IN</codeph>,
+        <codeph>LIKE</codeph>, and <codeph>BETWEEN</codeph> in the 
<codeph>PARTITION</codeph>
+        clause, instead of only equality operators. For example:
+<codeblock><![CDATA[
+show files in sample_table partition (j < 5);
+show files in sample_table partition (k = 3, l between 1 and 10);
+show files in sample_table partition (month like 'J%');
+]]>
+</codeblock>
+      </p>
+
       <note>
         This statement applies to tables and partitions stored on HDFS, or in 
the Amazon Simple Storage System (S3).
         It does not apply to views.

Reply via email to