IMPALA-5359: [DOCS] Document SORT BY syntax for CREATE TABLE and ALTER TABLE
Review part 1: Just tackling the CREATE TABLE syntax and examples for now. Will have separate gerrits for ALTER TABLE and more far-flung places to mention this feature (New Features, Performance Considerations, Parquet File Format). Change-Id: Icd571cd8840368edb327d16d27192458838ef234 Reviewed-on: http://gerrit.cloudera.org:8080/6981 Reviewed-by: Lars Volker <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/a0e3a06f Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/a0e3a06f Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/a0e3a06f Branch: refs/heads/master Commit: a0e3a06fc7ccfb555e86492dc49ce1511192b984 Parents: 95c74b7 Author: John Russell <[email protected]> Authored: Wed May 24 14:18:18 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Fri Jun 2 15:35:19 2017 +0000 ---------------------------------------------------------------------- docs/topics/impala_create_table.xml | 80 ++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/a0e3a06f/docs/topics/impala_create_table.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_create_table.xml b/docs/topics/impala_create_table.xml index 2552392..14770a9 100644 --- a/docs/topics/impala_create_table.xml +++ b/docs/topics/impala_create_table.xml @@ -90,6 +90,7 @@ under the License. [, ...] ) [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] + <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, <varname>column</varname> ...]])]</ph> [COMMENT '<varname>table_comment</varname>'] [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] [ @@ -106,6 +107,7 @@ under the License. <codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <varname>db_name</varname>.]<varname>table_name</varname> <ph rev="2.5.0">[PARTITIONED BY (<varname>col_name</varname>[, ...])]</ph> + <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, <varname>column</varname> ...]])]</ph> [COMMENT '<varname>table_comment</varname>'] [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] [ @@ -164,6 +166,7 @@ file_format: <codeblock>CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [<varname>db_name</varname>.]<varname>table_name</varname> LIKE PARQUET '<varname>hdfs_path_of_parquet_file</varname>' + <ph rev="2.9.0 IMPALA-4166">[SORT BY ([<varname>column</varname> [, <varname>column</varname> ...]])]</ph> [COMMENT '<varname>table_comment</varname>'] [PARTITIONED BY (<varname>col_name</varname> <varname>data_type</varname> [COMMENT '<varname>col_comment</varname>'], ...)] [WITH SERDEPROPERTIES ('<varname>key1</varname>'='<varname>value1</varname>', '<varname>key2</varname>'='<varname>value2</varname>', ...)] @@ -375,6 +378,83 @@ AS SELECT</codeph> syntax variation. </p> + <p rev="2.9.0 IMPALA-4166"> + <b>Sorted tables (SORT BY clause):</b> + </p> + + <p rev="2.9.0 IMPALA-4166"> + The optional <codeph>SORT BY</codeph> clause lets you specify zero or more columns + that are sorted in the data files created by each Impala <codeph>INSERT</codeph> or + <codeph>CREATE TABLE AS SELECT</codeph> operation. Creating data files that are + sorted is most useful for Parquet tables, where the metadata stored inside each file includes + the minimum and maximum values for each column in the file. (The statistics apply to each row group + within the file; for simplicity, Impala writes a single row group in each file.) Grouping + data values together in relatively narrow ranges within each data file makes it possible + for Impala to quickly skip over data files that do not contain value ranges indicated in + the <codeph>WHERE</codeph> clause of a query, and can improve the effectiveness + of Parquet encoding and compression. + </p> + + <p rev="2.9.0 IMPALA-4166"> + This clause is not applicable for Kudu tables or HBase tables. Although it works + for other HDFS file formats besides Parquet, the more efficient layout is most + evident with Parquet tables, because each Parquet data file includes statistics + about the data values in that file. + </p> + + <p rev="2.9.0 IMPALA-4166"> + The <codeph>SORT BY</codeph> columns cannot include any partition key columns + for a partitioned table, because those column values are not represented in + the underlying data files. + </p> + + <p rev="2.9.0 IMPALA-4166"> + Because data files can arrive in Impala tables by mechanisms that do not respect + the <codeph>SORT BY</codeph> clause, such as <codeph>LOAD DATA</codeph> or ETL + tools that create HDFS files, Impala does not guarantee or rely on the data being + sorted. The sorting aspect is only used to create a more efficient layout for + Parquet files generated by Impala, which helps to optimize the processing of + those Parquet files during Impala queries. During an <codeph>INSERT</codeph> + or <codeph>CREATE TABLE AS SELECT</codeph> operation, the sorting occurs + when the <codeph>SORT BY</codeph> clause applies to the destination table + for the data, regardless of whether the source table has a <codeph>SORT BY</codeph> + clause. + </p> + + <p rev="2.9.0 IMPALA-4166"> + For example, when creating a table intended to contain census data, you might define + sort columns such as last name and state. If a data file in this table contains a + narrow range of last names, for example from <codeph>Smith</codeph> to <codeph>Smythe</codeph>, + Impala can quickly detect that this data file contains no matches for a <codeph>WHERE</codeph> + clause such as <codeph>WHERE last_name = 'Jones'</codeph> and avoid reading the entire file. + </p> + +<codeblock rev="2.9.0 IMPALA-4166">CREATE TABLE census_data (last_name STRING, first_name STRING, state STRING, address STRING) + SORT BY (last_name, state) + STORED AS PARQUET; +</codeblock> + + <p rev="2.9.0 IMPALA-4166"> + Likewise, if an existing table contains data without any sort order, you can reorganize + the data in a more efficient way by using <codeph>INSERT</codeph> or + <codeph>CREATE TABLE AS SELECT</codeph> to copy that data into a new table with a + <codeph>SORT BY</codeph> clause: + </p> + +<codeblock rev="2.9.0 IMPALA-4166">CREATE TABLE sorted_census_data + SORT BY (last_name, state) + STORED AS PARQUET + AS SELECT last_name, first_name, state, address + FROM unsorted_census_data; +</codeblock> + + <p rev="2.9.0 IMPALA-4166"> + The metadata for the <codeph>SORT BY</codeph> clause is stored in the <codeph>TBLPROPERTIES</codeph> + fields for the table. Other SQL engines that can interoperate with Impala tables, such as Hive + and Spark SQL, do not recognize this property when inserting into a table that has a <codeph>SORT BY</codeph> + clause. + </p> + <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/> <p rev="kudu">
