This is an automated email from the ASF dual-hosted git repository. arodoni pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/impala.git
commit 9a3633c75f84601e7b8d05522daa6a9e0cf66a53 Author: Alex Rodoni <arod...@cloudera.com> AuthorDate: Thu Feb 7 11:38:43 2019 -0800 [DOCS] Format fixes in the TABLESAMPLE doc Change-Id: I2304e422636d03d7d1109d3f0a551d6f04ce1a63 Reviewed-on: http://gerrit.cloudera.org:8080/12392 Reviewed-by: Alex Rodoni <arod...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- docs/topics/impala_tablesample.xml | 362 +++++++++++++++++-------------------- 1 file changed, 165 insertions(+), 197 deletions(-) diff --git a/docs/topics/impala_tablesample.xml b/docs/topics/impala_tablesample.xml index e5123cb..67372fa 100644 --- a/docs/topics/impala_tablesample.xml +++ b/docs/topics/impala_tablesample.xml @@ -21,6 +21,7 @@ under the License. <concept id="tablesample" rev="IMPALA-5309"> <title>TABLESAMPLE Clause</title> + <prolog> <metadata> <data name="Category" value="Impala"/> @@ -34,263 +35,250 @@ under the License. <conbody> <p> - Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need - to explore the data distribution within the table, the table is very large, - and it is impractical or unnecessary to process all the data from the table - or selected partitions. + Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need to explore the + data distribution within the table, the table is very large, and it is impractical or + unnecessary to process all the data from the table or selected partitions. </p> <p> - The clause makes the query process a randomized set of data files from the - table, so that the total volume of data is greater than or equal to the specified - percentage of data bytes within that table. (Or the data bytes within the set of - partitions that remain after partition pruning is performed.) + The clause makes the query process a randomized set of data files from the table, so that + the total volume of data is greater than or equal to the specified percentage of data + bytes within that table. (Or the data bytes within the set of partitions that remain after + partition pruning is performed.) </p> <p conref="../shared/impala_common.xml#common/syntax_blurb"/> <codeblock> - <ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph> +<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph> </codeblock> <p> - The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table alias. + The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table + alias. </p> <p> - The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, - Impala only supports a single sampling method named <codeph>SYSTEM</codeph>. + The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, Impala only + supports a single sampling method named <codeph>SYSTEM</codeph>. </p> <p> - The <varname>percentage</varname> argument is an integer literal from 0 to 100. - A percentage of 0 produces an empty result set for a particular table reference, - while a percentage of 100 uses the entire contents. Because the sampling works by - selecting a random set of data files, the proportion of sampled data from the - table may be greater than the specified percentage, based on the number and sizes - of the underlying data files. See the usage notes for details. + The <varname>percentage</varname> argument is an integer literal from 0 to 100. A + percentage of 0 produces an empty result set for a particular table reference, while a + percentage of 100 uses the entire contents. Because the sampling works by selecting a + random set of data files, the proportion of sampled data from the table may be greater + than the specified percentage, based on the number and sizes of the underlying data files. + See the usage notes for details. </p> <p> - The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary - positive integer seed value that ensures that when the query is run again, the - sampling selects the same set of data files each time. <codeph>REPEATABLE</codeph> - does not have a default value. If you omit the <codeph>REPEATABLE</codeph> keyword, - the random seed is derived from the current time. + The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary positive + integer seed value that ensures that when the query is run again, the sampling selects the + same set of data files each time. <codeph>REPEATABLE</codeph> does not have a default + value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random seed is derived + from the current time. </p> <p conref="../shared/impala_common.xml#common/added_in_290"/> <p rev="2.12.0 IMPALA-5310"> - See <keyword keyref="compute_stats"/> for the - <codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE - STATS</codeph> statement. + See <keyword keyref="compute_stats"/> for the <codeph>TABLESAMPLE</codeph> clause used in + the <codeph>COMPUTE STATS</codeph> statement. </p> <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> <p> - You might use this clause with aggregation queries, such as finding - the approximate average, minimum, or maximum where exact precision - is not required. You can use these findings to plan the most effective - strategy for constructing queries against the full table or designing - a partitioning strategy for the data. + You might use this clause with aggregation queries, such as finding the approximate + average, minimum, or maximum where exact precision is not required. You can use these + findings to plan the most effective strategy for constructing queries against the full + table or designing a partitioning strategy for the data. </p> <p> - Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. - The Impala syntax for this clause is modeled on the syntax for popular - relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause. - For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL. + Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. The Impala syntax + for this clause is modeled on the syntax for popular relational databases, not the Hive + <codeph>TABLESAMPLE</codeph> clause. For example, there is no <codeph>BUCKETS</codeph> + keyword as in HiveQL. </p> <p> - The precision of the <varname>percentage</varname> threshold depends on - the number and sizes of the underlying data files. Impala brings in - additional data files, one at a time, until the number of bytes exceeds - the specified percentage based on the total number of bytes for the - entire set of table data. The precision of the percentage threshold is higher - when the table contains many data files with consistent sizes. See the - code listings later in this section for examples. + The precision of the <varname>percentage</varname> threshold depends on the number and + sizes of the underlying data files. Impala brings in additional data files, one at a time, + until the number of bytes exceeds the specified percentage based on the total number of + bytes for the entire set of table data. The precision of the percentage threshold is + higher when the table contains many data files with consistent sizes. See the code + listings later in this section for examples. </p> <p> - When you estimate characteristics of the data distribution based on sampling - a percentage of the table data, be aware that the data might be unevenly distributed - between different files. Do not assume that the percentage figure reflects the - percentage of rows in the table. For example, one file might contain all blank values - for a <codeph>STRING</codeph> column, while another file contains long strings - in that column; therefore, one file could contain many more rows than another. - Likewise, a table created with the <codeph>SORT BY</codeph> clause might - contain narrow ranges of values for the sort columns, making it impractical to - extrapolate the number of distinct values for those columns based on sampling - only some of the data files. + When you estimate characteristics of the data distribution based on sampling a percentage + of the table data, be aware that the data might be unevenly distributed between different + files. Do not assume that the percentage figure reflects the percentage of rows in the + table. For example, one file might contain all blank values for a <codeph>STRING</codeph> + column, while another file contains long strings in that column; therefore, one file could + contain many more rows than another. Likewise, a table created with the <codeph>SORT + BY</codeph> clause might contain narrow ranges of values for the sort columns, making it + impractical to extrapolate the number of distinct values for those columns based on + sampling only some of the data files. </p> <p> - Because a sample of the table data might not contain all values for a particular - column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the - key relationships between the tables might produce incomplete result sets - compared to joins using all the table data. For example, if you join 50% - of table A with 50% of table B, some values in the join columns might - not match between the two tables, even though overall there is a 1:1 + Because a sample of the table data might not contain all values for a particular column, + if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key relationships between + the tables might produce incomplete result sets compared to joins using all the table + data. For example, if you join 50% of table A with 50% of table B, some values in the join + columns might not match between the two tables, even though overall there is a 1:1 relationship between the tables. </p> <p> - The <codeph>REPEATABLE</codeph> keyword makes identical queries use a - consistent set of data files when the query is repeated. You specify an - arbitrary integer key that acts as a seed value when Impala randomly - selects the set of data files to use in the query. This technique - lets you verify correctness, examine performance, and so on for queries - using the <codeph>TABLESAMPLE</codeph> clause without the sampled data - being different each time. The repeatable aspect is reset (that is, the - set of selected data files may change) any time the contents of the table - change. The statements or operations that can make sampling results - non-repeatable are: + The <codeph>REPEATABLE</codeph> keyword makes identical queries use a consistent set of + data files when the query is repeated. You specify an arbitrary integer key that acts as a + seed value when Impala randomly selects the set of data files to use in the query. This + technique lets you verify correctness, examine performance, and so on for queries using + the <codeph>TABLESAMPLE</codeph> clause without the sampled data being different each + time. The repeatable aspect is reset (that is, the set of selected data files may change) + any time the contents of the table change. The statements or operations that can make + sampling results non-repeatable are: </p> <ul> <li> <codeph>INSERT</codeph>. </li> + <li> <codeph>TRUNCATE TABLE</codeph>. </li> + <li> <codeph>LOAD DATA</codeph>. </li> + <li> - <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> - after files are added or removed by a non-Impala mechanism. - </li> - <li> + <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after files are added + or removed by a non-Impala mechanism. </li> + + <li></li> </ul> <p> - This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, - because both serve to limit the size of the intermediate data and final - result set. <codeph>LIMIT 0</codeph> is more efficient than - <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can execute - without producing any results. <codeph>TABLESAMPLE SYSTEM(<varname>n</varname>)</codeph> - often makes query processing more efficient than using a <codeph>LIMIT</codeph> clause - by itself, because all phases of query execution use less data overall. - If the intent is to retrieve some representative values from the table - in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, - <codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query. + This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, because both + serve to limit the size of the intermediate data and final result set. <codeph>LIMIT + 0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying + that a query can execute without producing any results. <codeph>TABLESAMPLE + SYSTEM(<varname>n</varname>)</codeph> often makes query processing more efficient than + using a <codeph>LIMIT</codeph> clause by itself, because all phases of query execution use + less data overall. If the intent is to retrieve some representative values from the table + in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, <codeph>ORDER + BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query. </p> <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> + <p> - When you query a partitioned table, any partition pruning happens - before Impala selects the data files to sample. For example, in a - table partitioned by year, a query with <codeph>WHERE year = 2017</codeph> - and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample - data files representing at least 10% of the bytes present in the - 2017 partition. + When you query a partitioned table, any partition pruning happens before Impala selects + the data files to sample. For example, in a table partitioned by year, a query with + <codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause + would sample data files representing at least 10% of the bytes present in the 2017 + partition. </p> <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p> - This clause applies to S3 tables the same way as tables - with data files stored on HDFS. + This clause applies to S3 tables the same way as tables with data files stored on HDFS. </p> <p conref="../shared/impala_common.xml#common/adls_blurb"/> + <p> - This clause applies to ADLS tables the same way as tables - with data files stored on HDFS. + This clause applies to ADLS tables the same way as tables with data files stored on HDFS. </p> <p conref="../shared/impala_common.xml#common/kudu_blurb"/> + <p> This clause does not apply to Kudu tables. </p> <p conref="../shared/impala_common.xml#common/hbase_blurb"/> + <p> This clause does not apply to HBase tables. </p> <p conref="../shared/impala_common.xml#common/performance_blurb"/> + <p> - From a performance perspective, the <codeph>TABLESAMPLE</codeph> - clause is especially valuable for exploratory queries on - text, Avro, or other file formats other than Parquet. Text-based - or row-oriented file formats must process substantial amounts of - redundant data for queries that derive aggregate results such as - <codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> - for a single column. Therefore, you might use <codeph>TABLESAMPLE</codeph> - early in the ETL pipeline, when data is still in raw text format - and has not been converted to Parquet or moved into a partitioned - table. + From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause is especially + valuable for exploratory queries on text, Avro, or other file formats other than Parquet. + Text-based or row-oriented file formats must process substantial amounts of redundant data + for queries that derive aggregate results such as <codeph>MAX()</codeph>, + <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. Therefore, you + might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when data is still in + raw text format and has not been converted to Parquet or moved into a partitioned table. </p> <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> <p> - This clause applies only to tables that use a storage layer - with underlying raw data files, such as HDFS, Amazon S3, - or Microsoft ADLS. + This clause applies only to tables that use a storage layer with underlying raw data + files, such as HDFS, Amazon S3, or Microsoft ADLS. </p> <p> - This clause does not apply to table references that represent views. - A query that applies the <codeph>TABLESAMPLE</codeph> clause to a - view or a subquery fails with a semantic error. + This clause does not apply to table references that represent views. A query that applies + the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails with a semantic + error. </p> <p> - Because the sampling works at the level of entire data files, it - is by nature coarse-grained. It is possible to specify a small - sample percentage but still process a substantial portion of the - table data if the table contains relatively few data files, if - each data file is very large, or if the data files vary substantially - in size. Be sure that you understand the data distribution and physical - file layout so that you can verify if the results are suitable for - extrapolation. For example, if the table contains only a single data file, - the <q>sample</q> will consist of all the table data regardless of - the percentage you specify. If the table contains data files of - 1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of - 50 you would either process slightly more than 50% of the table - (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), - depending on which data files were selected for sampling. + Because the sampling works at the level of entire data files, it is by nature + coarse-grained. It is possible to specify a small sample percentage but still process a + substantial portion of the table data if the table contains relatively few data files, if + each data file is very large, or if the data files vary substantially in size. Be sure + that you understand the data distribution and physical file layout so that you can verify + if the results are suitable for extrapolation. For example, if the table contains only a + single data file, the <q>sample</q> will consist of all the table data regardless of the + percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when + you specify a sampling percentage of 50 you would either process slightly more than 50% of + the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which + data files were selected for sampling. </p> <p> - If data files are added by a non-Impala mechanism, and the - table metadata is not updated by a <codeph>REFRESH</codeph> - or <codeph>INVALIDATE METADATA</codeph> statement, the - <codeph>TABLESAMPLE</codeph> clause does not consider those - new files when computing the number of bytes in the table - or selecting which files to sample. + If data files are added by a non-Impala mechanism, and the table metadata is not updated + by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the + <codeph>TABLESAMPLE</codeph> clause does not consider those new files when computing the + number of bytes in the table or selecting which files to sample. </p> <p> - If data files are removed by a non-Impala mechanism, and the - table metadata is not updated by a <codeph>REFRESH</codeph> - or <codeph>INVALIDATE METADATA</codeph> statement, the - query fails if the <codeph>TABLESAMPLE</codeph> clause - attempts to reference any of the missing files. + If data files are removed by a non-Impala mechanism, and the table metadata is not updated + by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the query + fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference any of the missing + files. </p> <p conref="../shared/impala_common.xml#common/example_blurb"/> <p> - The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. - These examples intentionally use very small data sets to illustrate how - the number of files, size of each file, and overall size of data in the table - interact with the percentage specified in the clause. + The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. These examples + intentionally use very small data sets to illustrate how the number of files, size of each + file, and overall size of data in the table interact with the percentage specified in the + clause. </p> <p> - These examples use an unpartitioned table, containing several files of roughly - the same size: + These examples use an unpartitioned table, containing several files of roughly the same + size: </p> -<codeblock><![CDATA[ -create table sample_demo (x int, s string); +<codeblock>create table sample_demo (x int, s string); insert into sample_demo values (1, 'one'); insert into sample_demo values (2, 'two'); @@ -314,20 +302,16 @@ show table stats sample_demo; | #Rows | #Files | Size | Format | Location | +-------+--------+------+--------+-------------------------+ | -1 | 5 | 34B | TEXT | /tsample.db/sample_demo | -+-------+--------+------+--------+-------------------------+ -</codeblock> ++-------+--------+------+--------+-------------------------+</codeblock> <p> - A query that samples 50% of the table must process at least - 17 bytes of data. Based on the sizes of the data files, - we can predict that each such query uses 3 arbitrary files. - Any 1 or 2 files are not enough to reach 50% of the total - data in the table (34 bytes), so the query adds more files - until it passes the 50% threshold: + A query that samples 50% of the table must process at least 17 bytes of data. Based on the + sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1 + or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the + query adds more files until it passes the 50% threshold: </p> -<codeblock><![CDATA[ -select distinct x from sample_demo tablesample system(50); +<codeblock>select distinct x from sample_demo tablesample system(50); +---+ | x | +---+ @@ -352,19 +336,16 @@ select distinct x from sample_demo tablesample system(50); | 5 | | 3 | | 2 | -+---+ -</codeblock> ++---+</codeblock> <p> - To help run reproducible experiments, the <codeph>REPEATABLE</codeph> - clause causes Impala to choose the same set of files for each query. - Although the data set being considered is deterministic, the order - of results varies (in the absence of an <codeph>ORDER BY</codeph> + To help run reproducible experiments, the <codeph>REPEATABLE</codeph> clause causes Impala + to choose the same set of files for each query. Although the data set being considered is + deterministic, the order of results varies (in the absence of an <codeph>ORDER BY</codeph> clause) because of the way distributed queries are processed: </p> -<codeblock><![CDATA[ -select distinct x from sample_demo +<codeblock>select distinct x from sample_demo tablesample system(50) repeatable (12345); +---+ | x | @@ -386,17 +367,13 @@ select distinct x from sample_demo </codeblock> <p> - The following examples show how uneven data distribution affects - which data is sampled. Adding another data file containing a long - string value changes the threshold for 50% of the total data in - the table: + The following examples show how uneven data distribution affects which data is sampled. + Adding another data file containing a long string value changes the threshold for 50% of + the total data in the table: </p> -<codeblock><![CDATA[ -insert into sample_demo values (1000, 'Boyhood is the longest time in li -fe for a boy. The last term of the school-year is made of decades, not o -f weeks, and living through them is like waiting for the millennium. Boo -th Tarkington'); +<codeblock>insert into sample_demo values +(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington'); show files in sample_demo; +---------------------+------+-----------+ @@ -419,17 +396,15 @@ show table stats sample_demo; </codeblock> <p> - Even though the queries do not refer to the <codeph>S</codeph> - column containing the long value, all the sampling queries include - the data file containing the column value <codeph>X=1000</codeph>, - because the query cannot reach the 50% threshold (115 bytes) without - including that file. The large file might be considered first, in which - case it is the only file processed by the query. Or an arbitrary - set of other files might be considered first. + Even though the queries do not refer to the <codeph>S</codeph> column containing the long + value, all the sampling queries include the data file containing the column value + <codeph>X=1000</codeph>, because the query cannot reach the 50% threshold (115 bytes) + without including that file. The large file might be considered first, in which case it is + the only file processed by the query. Or an arbitrary set of other files might be + considered first. </p> -<codeblock><![CDATA[ -select distinct x from sample_demo tablesample system(50); +<codeblock>select distinct x from sample_demo tablesample system(50); +------+ | x | +------+ @@ -453,17 +428,14 @@ select distinct x from sample_demo tablesample system(50); | 4 | | 2 | | 1 | -+------+ -</codeblock> ++------+</codeblock> <p> - The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> - clause interacts with other table aspects, such as partitioning and file - format: + The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> clause interacts + with other table aspects, such as partitioning and file format: </p> -<codeblock><![CDATA[ -create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet; +<codeblock>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet; insert into sample_demo_partitions partition (n = 1) select * from sample_demo; insert into sample_demo_partitions partition (n = 2) select * from sample_demo; @@ -493,12 +465,11 @@ show table stats tablesample_demo_partitioned; </codeblock> <p> - If the query does not involve any partition pruning, the - sampling applies to the data volume of the entire table: + If the query does not involve any partition pruning, the sampling applies to the data + volume of the entire table: </p> -<codeblock><![CDATA[ --- 18 rows total. +<codeblock>-- 18 rows total. select count(*) from sample_demo_partitions; +----------+ | count(*) | @@ -524,18 +495,14 @@ select count(*) from sample_demo_partitions | count(*) | +----------+ | 16 | -+----------+ -</codeblock> ++----------+</codeblock> <p> - If the query only processes certain partitions, - the query computes the sampling threshold based on - the data size and set of files only from the - relevant partitions: + If the query only processes certain partitions, the query computes the sampling threshold + based on the data size and set of files only from the relevant partitions: </p> -<codeblock><![CDATA[ -select count(*) from sample_demo_partitions +<codeblock>select count(*) from sample_demo_partitions tablesample system(50) where n = 1; +----------+ | count(*) | @@ -550,13 +517,14 @@ select count(*) from sample_demo_partitions +----------+ | 2 | +----------+ -]]> </codeblock> <p conref="../shared/impala_common.xml#common/related_info"/> + <p> <xref href="impala_select.xml#select"/> </p> </conbody> + </concept>