IMPALA-5359: [DOCS] Document SORT BY syntax for CREATE TABLE and ALTER TABLE

Review part 1: Just tackling the CREATE TABLE syntax and examples for
now. Will have separate gerrits for ALTER TABLE and more far-flung
places to mention this feature (New Features, Performance
Considerations, Parquet File Format).

Change-Id: Icd571cd8840368edb327d16d27192458838ef234
Reviewed-on: http://gerrit.cloudera.org:8080/6981
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/a0e3a06f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/a0e3a06f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/a0e3a06f

Branch: refs/heads/master
Commit: a0e3a06fc7ccfb555e86492dc49ce1511192b984
Parents: 95c74b7
Author: John Russell <[email protected]>
Authored: Wed May 24 14:18:18 2017 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Fri Jun 2 15:35:19 2017 +0000

----------------------------------------------------------------------
 docs/topics/impala_create_table.xml | 80 ++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/a0e3a06f/docs/topics/impala_create_table.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_create_table.xml 
b/docs/topics/impala_create_table.xml
index 2552392..14770a9 100644
--- a/docs/topics/impala_create_table.xml
+++ b/docs/topics/impala_create_table.xml
@@ -90,6 +90,7 @@ under the License.
     [, ...]
   )
   [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> 
[COMMENT '<varname>col_comment</varname>'], ...)]
+  <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, 
<varname>column</varname> ...]])]</ph>
   [COMMENT '<varname>table_comment</varname>']
   [WITH SERDEPROPERTIES 
('<varname>key1</varname>'='<varname>value1</varname>', 
'<varname>key2</varname>'='<varname>value2</varname>', ...)]
   [
@@ -106,6 +107,7 @@ under the License.
 
 <codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] 
<varname>db_name</varname>.]<varname>table_name</varname>
   <ph rev="2.5.0">[PARTITIONED BY (<varname>col_name</varname>[, ...])]</ph>
+  <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, 
<varname>column</varname> ...]])]</ph>
   [COMMENT '<varname>table_comment</varname>']
   [WITH SERDEPROPERTIES 
('<varname>key1</varname>'='<varname>value1</varname>', 
'<varname>key2</varname>'='<varname>value2</varname>', ...)]
   [
@@ -164,6 +166,7 @@ file_format:
 
 <codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] 
[<varname>db_name</varname>.]<varname>table_name</varname>
   LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>'
+  <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, 
<varname>column</varname> ...]])]</ph>
   [COMMENT '<varname>table_comment</varname>']
   [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> 
[COMMENT '<varname>col_comment</varname>'], ...)]
   [WITH SERDEPROPERTIES 
('<varname>key1</varname>'='<varname>value1</varname>', 
'<varname>key2</varname>'='<varname>value2</varname>', ...)]
@@ -375,6 +378,83 @@ AS
       SELECT</codeph> syntax variation.
     </p>
 
+    <p rev="2.9.0 IMPALA-4166">
+      <b>Sorted tables (SORT BY clause):</b>
+    </p>
+
+    <p rev="2.9.0 IMPALA-4166">
+      The optional <codeph>SORT BY</codeph> clause lets you specify zero or 
more columns
+      that are sorted in the data files created by each Impala 
<codeph>INSERT</codeph> or
+      <codeph>CREATE TABLE AS SELECT</codeph> operation. Creating data files 
that are
+      sorted is most useful for Parquet tables, where the metadata stored 
inside each file includes
+      the minimum and maximum values for each column in the file. (The 
statistics apply to each row group
+      within the file; for simplicity, Impala writes a single row group in 
each file.) Grouping
+      data values together in relatively narrow ranges within each data file 
makes it possible
+      for Impala to quickly skip over data files that do not contain value 
ranges indicated in
+      the <codeph>WHERE</codeph> clause of a query, and can improve the 
effectiveness
+      of Parquet encoding and compression.
+    </p>
+
+    <p rev="2.9.0 IMPALA-4166">
+      This clause is not applicable for Kudu tables or HBase tables. Although 
it works
+      for other HDFS file formats besides Parquet, the more efficient layout 
is most
+      evident with Parquet tables, because each Parquet data file includes 
statistics
+      about the data values in that file.
+    </p>
+
+    <p rev="2.9.0 IMPALA-4166">
+      The <codeph>SORT BY</codeph> columns cannot include any partition key 
columns
+      for a partitioned table, because those column values are not represented 
in
+      the underlying data files.
+    </p>
+
+    <p rev="2.9.0 IMPALA-4166">
+      Because data files can arrive in Impala tables by mechanisms that do not 
respect
+      the <codeph>SORT BY</codeph> clause, such as <codeph>LOAD DATA</codeph> 
or ETL
+      tools that create HDFS files, Impala does not guarantee or rely on the 
data being
+      sorted. The sorting aspect is only used to create a more efficient 
layout for
+      Parquet files generated by Impala, which helps to optimize the 
processing of
+      those Parquet files during Impala queries. During an 
<codeph>INSERT</codeph>
+      or <codeph>CREATE TABLE AS SELECT</codeph> operation, the sorting occurs
+      when the <codeph>SORT BY</codeph> clause applies to the destination table
+      for the data, regardless of whether the source table has a <codeph>SORT 
BY</codeph>
+      clause.
+    </p>
+
+    <p rev="2.9.0 IMPALA-4166">
+      For example, when creating a table intended to contain census data, you 
might define
+      sort columns such as last name and state. If a data file in this table 
contains a
+      narrow range of last names, for example from <codeph>Smith</codeph> to 
<codeph>Smythe</codeph>,
+      Impala can quickly detect that this data file contains no matches for a 
<codeph>WHERE</codeph>
+      clause such as <codeph>WHERE last_name = 'Jones'</codeph> and avoid 
reading the entire file.
+    </p>
+
+<codeblock rev="2.9.0 IMPALA-4166">CREATE TABLE census_data (last_name STRING, 
first_name STRING, state STRING, address STRING)
+  SORT BY (last_name, state)
+  STORED AS PARQUET;
+</codeblock>
+
+    <p rev="2.9.0 IMPALA-4166">
+      Likewise, if an existing table contains data without any sort order, you 
can reorganize
+      the data in a more efficient way by using <codeph>INSERT</codeph> or
+      <codeph>CREATE TABLE AS SELECT</codeph> to copy that data into a new 
table with a
+      <codeph>SORT BY</codeph> clause:
+    </p>
+
+<codeblock rev="2.9.0 IMPALA-4166">CREATE TABLE sorted_census_data
+  SORT BY (last_name, state)
+  STORED AS PARQUET
+  AS SELECT last_name, first_name, state, address
+    FROM unsorted_census_data;
+</codeblock>
+
+    <p rev="2.9.0 IMPALA-4166">
+      The metadata for the <codeph>SORT BY</codeph> clause is stored in the 
<codeph>TBLPROPERTIES</codeph>
+      fields for the table. Other SQL engines that can interoperate with 
Impala tables, such as Hive
+      and Spark SQL, do not recognize this property when inserting into a 
table that has a <codeph>SORT BY</codeph>
+      clause.
+    </p>
+
     <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
 
     <p rev="kudu">

Reply via email to