[impala] 01/03: [DOCS] Format fixes in the TABLESAMPLE doc

arodoni Thu, 07 Feb 2019 13:50:07 -0800

This is an automated email from the ASF dual-hosted git repository.

arodoni pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git


commit 9a3633c75f84601e7b8d05522daa6a9e0cf66a53
Author: Alex Rodoni <arod...@cloudera.com>
AuthorDate: Thu Feb 7 11:38:43 2019 -0800

    [DOCS] Format fixes in the TABLESAMPLE doc
    
    Change-Id: I2304e422636d03d7d1109d3f0a551d6f04ce1a63
    Reviewed-on: http://gerrit.cloudera.org:8080/12392
    Reviewed-by: Alex Rodoni <arod...@cloudera.com>
    Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
 docs/topics/impala_tablesample.xml | 362 +++++++++++++++++--------------------
 1 file changed, 165 insertions(+), 197 deletions(-)

diff --git a/docs/topics/impala_tablesample.xml 
b/docs/topics/impala_tablesample.xml
index e5123cb..67372fa 100644
--- a/docs/topics/impala_tablesample.xml
+++ b/docs/topics/impala_tablesample.xml
@@ -21,6 +21,7 @@ under the License.
 <concept id="tablesample" rev="IMPALA-5309">
 
   <title>TABLESAMPLE Clause</title>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -34,263 +35,250 @@ under the License.
   <conbody>
 
     <p>
-      Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need
-      to explore the data distribution within the table, the table is very 
large,
-      and it is impractical or unnecessary to process all the data from the 
table
-      or selected partitions.
+      Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need 
to explore the
+      data distribution within the table, the table is very large, and it is 
impractical or
+      unnecessary to process all the data from the table or selected 
partitions.
     </p>
 
     <p>
-      The clause makes the query process a randomized set of data files from 
the
-      table, so that the total volume of data is greater than or equal to the 
specified
-      percentage of data bytes within that table. (Or the data bytes within 
the set of
-      partitions that remain after partition pruning is performed.)
+      The clause makes the query process a randomized set of data files from 
the table, so that
+      the total volume of data is greater than or equal to the specified 
percentage of data
+      bytes within that table. (Or the data bytes within the set of partitions 
that remain after
+      partition pruning is performed.)
     </p>
 
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
 
 <codeblock>
-  <ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) 
[REPEATABLE(<varname>seed</varname>)]</ph>
+<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) 
[REPEATABLE(<varname>seed</varname>)]</ph>
 </codeblock>
 
     <p>
-      The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table 
name or table alias.
+      The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table 
name or table
+      alias.
     </p>
 
     <p>
-      The <codeph>SYSTEM</codeph> keyword represents the sampling method. 
Currently,
-      Impala only supports a single sampling method named 
<codeph>SYSTEM</codeph>.
+      The <codeph>SYSTEM</codeph> keyword represents the sampling method. 
Currently, Impala only
+      supports a single sampling method named <codeph>SYSTEM</codeph>.
     </p>
 
     <p>
-      The <varname>percentage</varname> argument is an integer literal from 0 
to 100.
-      A percentage of 0 produces an empty result set for a particular table 
reference,
-      while a percentage of 100 uses the entire contents. Because the sampling 
works by
-      selecting a random set of data files, the proportion of sampled data 
from the
-      table may be greater than the specified percentage, based on the number 
and sizes
-      of the underlying data files. See the usage notes for details.
+      The <varname>percentage</varname> argument is an integer literal from 0 
to 100. A
+      percentage of 0 produces an empty result set for a particular table 
reference, while a
+      percentage of 100 uses the entire contents. Because the sampling works 
by selecting a
+      random set of data files, the proportion of sampled data from the table 
may be greater
+      than the specified percentage, based on the number and sizes of the 
underlying data files.
+      See the usage notes for details.
     </p>
 
     <p>
-      The optional <codeph>REPEATABLE</codeph> keyword lets you specify an 
arbitrary
-      positive integer seed value that ensures that when the query is run 
again, the
-      sampling selects the same set of data files each time. 
<codeph>REPEATABLE</codeph>
-      does not have a default value. If you omit the 
<codeph>REPEATABLE</codeph> keyword,
-      the random seed is derived from the current time.
+      The optional <codeph>REPEATABLE</codeph> keyword lets you specify an 
arbitrary positive
+      integer seed value that ensures that when the query is run again, the 
sampling selects the
+      same set of data files each time. <codeph>REPEATABLE</codeph> does not 
have a default
+      value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random 
seed is derived
+      from the current time.
     </p>
 
     <p conref="../shared/impala_common.xml#common/added_in_290"/>
 
     <p rev="2.12.0 IMPALA-5310">
-      See <keyword keyref="compute_stats"/> for the
-        <codeph>TABLESAMPLE</codeph> clause used in the <codeph>COMPUTE
-        STATS</codeph> statement.
+      See <keyword keyref="compute_stats"/> for the 
<codeph>TABLESAMPLE</codeph> clause used in
+      the <codeph>COMPUTE STATS</codeph> statement.
     </p>
 
     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
 
     <p>
-      You might use this clause with aggregation queries, such as finding
-      the approximate average, minimum, or maximum where exact precision
-      is not required. You can use these findings to plan the most effective
-      strategy for constructing queries against the full table or designing
-      a partitioning strategy for the data.
+      You might use this clause with aggregation queries, such as finding the 
approximate
+      average, minimum, or maximum where exact precision is not required. You 
can use these
+      findings to plan the most effective strategy for constructing queries 
against the full
+      table or designing a partitioning strategy for the data.
     </p>
 
     <p>
-      Some other database systems have a <codeph>TABLESAMPLE</codeph> clause.
-      The Impala syntax for this clause is modeled on the syntax for popular
-      relational databases, not the Hive <codeph>TABLESAMPLE</codeph> clause.
-      For example, there is no <codeph>BUCKETS</codeph> keyword as in HiveQL.
+      Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. 
The Impala syntax
+      for this clause is modeled on the syntax for popular relational 
databases, not the Hive
+      <codeph>TABLESAMPLE</codeph> clause. For example, there is no 
<codeph>BUCKETS</codeph>
+      keyword as in HiveQL.
     </p>
 
     <p>
-      The precision of the <varname>percentage</varname> threshold depends on
-      the number and sizes of the underlying data files. Impala brings in
-      additional data files, one at a time, until the number of bytes exceeds
-      the specified percentage based on the total number of bytes for the
-      entire set of table data. The precision of the percentage threshold is 
higher
-      when the table contains many data files with consistent sizes. See the
-      code listings later in this section for examples.
+      The precision of the <varname>percentage</varname> threshold depends on 
the number and
+      sizes of the underlying data files. Impala brings in additional data 
files, one at a time,
+      until the number of bytes exceeds the specified percentage based on the 
total number of
+      bytes for the entire set of table data. The precision of the percentage 
threshold is
+      higher when the table contains many data files with consistent sizes. 
See the code
+      listings later in this section for examples.
     </p>
 
     <p>
-      When you estimate characteristics of the data distribution based on 
sampling
-      a percentage of the table data, be aware that the data might be unevenly 
distributed
-      between different files. Do not assume that the percentage figure 
reflects the
-      percentage of rows in the table. For example, one file might contain all 
blank values
-      for a <codeph>STRING</codeph> column, while another file contains long 
strings
-      in that column; therefore, one file could contain many more rows than 
another.
-      Likewise, a table created with the <codeph>SORT BY</codeph> clause might
-      contain narrow ranges of values for the sort columns, making it 
impractical to
-      extrapolate the number of distinct values for those columns based on 
sampling
-      only some of the data files.
+      When you estimate characteristics of the data distribution based on 
sampling a percentage
+      of the table data, be aware that the data might be unevenly distributed 
between different
+      files. Do not assume that the percentage figure reflects the percentage 
of rows in the
+      table. For example, one file might contain all blank values for a 
<codeph>STRING</codeph>
+      column, while another file contains long strings in that column; 
therefore, one file could
+      contain many more rows than another. Likewise, a table created with the 
<codeph>SORT
+      BY</codeph> clause might contain narrow ranges of values for the sort 
columns, making it
+      impractical to extrapolate the number of distinct values for those 
columns based on
+      sampling only some of the data files.
     </p>
 
     <p>
-      Because a sample of the table data might not contain all values for a 
particular
-      column, if the <codeph>TABLESAMPLE</codeph> is used in a join query, the
-      key relationships between the tables might produce incomplete result sets
-      compared to joins using all the table data. For example, if you join 50%
-      of table A with 50% of table B, some values in the join columns might
-      not match between the two tables, even though overall there is a 1:1
+      Because a sample of the table data might not contain all values for a 
particular column,
+      if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key 
relationships between
+      the tables might produce incomplete result sets compared to joins using 
all the table
+      data. For example, if you join 50% of table A with 50% of table B, some 
values in the join
+      columns might not match between the two tables, even though overall 
there is a 1:1
       relationship between the tables.
     </p>
 
     <p>
-      The <codeph>REPEATABLE</codeph> keyword makes identical queries use a
-      consistent set of data files when the query is repeated. You specify an
-      arbitrary integer key that acts as a seed value when Impala randomly
-      selects the set of data files to use in the query. This technique
-      lets you verify correctness, examine performance, and so on for queries
-      using the <codeph>TABLESAMPLE</codeph> clause without the sampled data
-      being different each time. The repeatable aspect is reset (that is, the
-      set of selected data files may change) any time the contents of the table
-      change. The statements or operations that can make sampling results
-      non-repeatable are:
+      The <codeph>REPEATABLE</codeph> keyword makes identical queries use a 
consistent set of
+      data files when the query is repeated. You specify an arbitrary integer 
key that acts as a
+      seed value when Impala randomly selects the set of data files to use in 
the query. This
+      technique lets you verify correctness, examine performance, and so on 
for queries using
+      the <codeph>TABLESAMPLE</codeph> clause without the sampled data being 
different each
+      time. The repeatable aspect is reset (that is, the set of selected data 
files may change)
+      any time the contents of the table change. The statements or operations 
that can make
+      sampling results non-repeatable are:
     </p>
 
     <ul>
       <li>
         <codeph>INSERT</codeph>.
       </li>
+
       <li>
         <codeph>TRUNCATE TABLE</codeph>.
       </li>
+
       <li>
         <codeph>LOAD DATA</codeph>.
       </li>
+
       <li>
-        <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph>
-        after files are added or removed by a non-Impala mechanism.
-      </li>
-      <li>
+        <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after 
files are added
+        or removed by a non-Impala mechanism.
       </li>
+
+      <li></li>
     </ul>
 
     <p>
-      This clause is similar in some ways to the <codeph>LIMIT</codeph> clause,
-      because both serve to limit the size of the intermediate data and final
-      result set. <codeph>LIMIT 0</codeph> is more efficient than
-      <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying that a query can 
execute
-      without producing any results. <codeph>TABLESAMPLE 
SYSTEM(<varname>n</varname>)</codeph>
-      often makes query processing more efficient than using a 
<codeph>LIMIT</codeph> clause
-      by itself, because all phases of query execution use less data overall.
-      If the intent is to retrieve some representative values from the table
-      in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>,
-      <codeph>ORDER BY</codeph>, and <codeph>LIMIT</codeph> clauses within a 
single query.
+      This clause is similar in some ways to the <codeph>LIMIT</codeph> 
clause, because both
+      serve to limit the size of the intermediate data and final result set. 
<codeph>LIMIT
+      0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> 
for verifying
+      that a query can execute without producing any results. 
<codeph>TABLESAMPLE
+      SYSTEM(<varname>n</varname>)</codeph> often makes query processing more 
efficient than
+      using a <codeph>LIMIT</codeph> clause by itself, because all phases of 
query execution use
+      less data overall. If the intent is to retrieve some representative 
values from the table
+      in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, 
<codeph>ORDER
+      BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
     </p>
 
     <p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
+
     <p>
-      When you query a partitioned table, any partition pruning happens
-      before Impala selects the data files to sample. For example, in a
-      table partitioned by year, a query with <codeph>WHERE year = 
2017</codeph>
-      and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause would sample
-      data files representing at least 10% of the bytes present in the
-      2017 partition.
+      When you query a partitioned table, any partition pruning happens before 
Impala selects
+      the data files to sample. For example, in a table partitioned by year, a 
query with
+      <codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE 
SYSTEM(10)</codeph> clause
+      would sample data files representing at least 10% of the bytes present 
in the 2017
+      partition.
     </p>
 
     <p conref="../shared/impala_common.xml#common/s3_blurb"/>
+
     <p>
-      This clause applies to S3 tables the same way as tables
-      with data files stored on HDFS.
+      This clause applies to S3 tables the same way as tables with data files 
stored on HDFS.
     </p>
 
     <p conref="../shared/impala_common.xml#common/adls_blurb"/>
+
     <p>
-      This clause applies to ADLS tables the same way as tables
-      with data files stored on HDFS.
+      This clause applies to ADLS tables the same way as tables with data 
files stored on HDFS.
     </p>
 
     <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
     <p>
       This clause does not apply to Kudu tables.
     </p>
 
     <p conref="../shared/impala_common.xml#common/hbase_blurb"/>
+
     <p>
       This clause does not apply to HBase tables.
     </p>
 
     <p conref="../shared/impala_common.xml#common/performance_blurb"/>
+
     <p>
-      From a performance perspective, the <codeph>TABLESAMPLE</codeph>
-      clause is especially valuable for exploratory queries on
-      text, Avro, or other file formats other than Parquet. Text-based
-      or row-oriented file formats must process substantial amounts of
-      redundant data for queries that derive aggregate results such as
-      <codeph>MAX()</codeph>, <codeph>MIN()</codeph>, or <codeph>AVG()</codeph>
-      for a single column. Therefore, you might use 
<codeph>TABLESAMPLE</codeph>
-      early in the ETL pipeline, when data is still in raw text format
-      and has not been converted to Parquet or moved into a partitioned
-      table.
+      From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause 
is especially
+      valuable for exploratory queries on text, Avro, or other file formats 
other than Parquet.
+      Text-based or row-oriented file formats must process substantial amounts 
of redundant data
+      for queries that derive aggregate results such as <codeph>MAX()</codeph>,
+      <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. 
Therefore, you
+      might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when 
data is still in
+      raw text format and has not been converted to Parquet or moved into a 
partitioned table.
     </p>
 
     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
 
     <p>
-      This clause applies only to tables that use a storage layer
-      with underlying raw data files, such as HDFS, Amazon S3,
-      or Microsoft ADLS.
+      This clause applies only to tables that use a storage layer with 
underlying raw data
+      files, such as HDFS, Amazon S3, or Microsoft ADLS.
     </p>
 
     <p>
-      This clause does not apply to table references that represent views.
-      A query that applies the <codeph>TABLESAMPLE</codeph> clause to a
-      view or a subquery fails with a semantic error.
+      This clause does not apply to table references that represent views. A 
query that applies
+      the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails 
with a semantic
+      error.
     </p>
 
     <p>
-      Because the sampling works at the level of entire data files, it
-      is by nature coarse-grained. It is possible to specify a small
-      sample percentage but still process a substantial portion of the
-      table data if the table contains relatively few data files, if
-      each data file is very large, or if the data files vary substantially
-      in size. Be sure that you understand the data distribution and physical
-      file layout so that you can verify if the results are suitable for
-      extrapolation. For example, if the table contains only a single data 
file,
-      the <q>sample</q> will consist of all the table data regardless of
-      the percentage you specify. If the table contains data files of
-      1 GiB, 1 GiB, and 1 KiB, when you specify a sampling percentage of
-      50 you would either process slightly more than 50% of the table
-      (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB),
-      depending on which data files were selected for sampling.
+      Because the sampling works at the level of entire data files, it is by 
nature
+      coarse-grained. It is possible to specify a small sample percentage but 
still process a
+      substantial portion of the table data if the table contains relatively 
few data files, if
+      each data file is very large, or if the data files vary substantially in 
size. Be sure
+      that you understand the data distribution and physical file layout so 
that you can verify
+      if the results are suitable for extrapolation. For example, if the table 
contains only a
+      single data file, the <q>sample</q> will consist of all the table data 
regardless of the
+      percentage you specify. If the table contains data files of 1 GiB, 1 
GiB, and 1 KiB, when
+      you specify a sampling percentage of 50 you would either process 
slightly more than 50% of
+      the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), 
depending on which
+      data files were selected for sampling.
     </p>
 
     <p>
-      If data files are added by a non-Impala mechanism, and the
-      table metadata is not updated by a <codeph>REFRESH</codeph>
-      or <codeph>INVALIDATE METADATA</codeph> statement, the
-      <codeph>TABLESAMPLE</codeph> clause does not consider those
-      new files when computing the number of bytes in the table
-      or selecting which files to sample.
+      If data files are added by a non-Impala mechanism, and the table 
metadata is not updated
+      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> 
statement, the
+      <codeph>TABLESAMPLE</codeph> clause does not consider those new files 
when computing the
+      number of bytes in the table or selecting which files to sample.
     </p>
 
     <p>
-      If data files are removed by a non-Impala mechanism, and the
-      table metadata is not updated by a <codeph>REFRESH</codeph>
-      or <codeph>INVALIDATE METADATA</codeph> statement, the
-      query fails if the <codeph>TABLESAMPLE</codeph> clause
-      attempts to reference any of the missing files.
+      If data files are removed by a non-Impala mechanism, and the table 
metadata is not updated
+      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> 
statement, the query
+      fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference 
any of the missing
+      files.
     </p>
 
     <p conref="../shared/impala_common.xml#common/example_blurb"/>
 
     <p>
-      The following examples demonstrate the <codeph>TABLESAMPLE</codeph> 
clause.
-      These examples intentionally use very small data sets to illustrate how
-      the number of files, size of each file, and overall size of data in the 
table
-      interact with the percentage specified in the clause.
+      The following examples demonstrate the <codeph>TABLESAMPLE</codeph> 
clause. These examples
+      intentionally use very small data sets to illustrate how the number of 
files, size of each
+      file, and overall size of data in the table interact with the percentage 
specified in the
+      clause.
     </p>
 
     <p>
-      These examples use an unpartitioned table, containing several files of 
roughly
-      the same size:
+      These examples use an unpartitioned table, containing several files of 
roughly the same
+      size:
     </p>
 
-<codeblock><![CDATA[
-create table sample_demo (x int, s string);
+<codeblock>create table sample_demo (x int, s string);
 
 insert into sample_demo values (1, 'one');
 insert into sample_demo values (2, 'two');
@@ -314,20 +302,16 @@ show table stats sample_demo;
 | #Rows | #Files | Size | Format | Location                |
 +-------+--------+------+--------+-------------------------+
 | -1    | 5      | 34B  | TEXT   | /tsample.db/sample_demo |
-+-------+--------+------+--------+-------------------------+
-</codeblock>
++-------+--------+------+--------+-------------------------+</codeblock>
 
     <p>
-      A query that samples 50% of the table must process at least
-      17 bytes of data. Based on the sizes of the data files,
-      we can predict that each such query uses 3 arbitrary files.
-      Any 1 or 2 files are not enough to reach 50% of the total
-      data in the table (34 bytes), so the query adds more files
-      until it passes the 50% threshold:
+      A query that samples 50% of the table must process at least 17 bytes of 
data. Based on the
+      sizes of the data files, we can predict that each such query uses 3 
arbitrary files. Any 1
+      or 2 files are not enough to reach 50% of the total data in the table 
(34 bytes), so the
+      query adds more files until it passes the 50% threshold:
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
 +---+
 | x |
 +---+
@@ -352,19 +336,16 @@ select distinct x from sample_demo tablesample system(50);
 | 5 |
 | 3 |
 | 2 |
-+---+
-</codeblock>
++---+</codeblock>
 
     <p>
-      To help run reproducible experiments, the <codeph>REPEATABLE</codeph>
-      clause causes Impala to choose the same set of files for each query.
-      Although the data set being considered is deterministic, the order
-      of results varies (in the absence of an <codeph>ORDER BY</codeph>
+      To help run reproducible experiments, the <codeph>REPEATABLE</codeph> 
clause causes Impala
+      to choose the same set of files for each query. Although the data set 
being considered is
+      deterministic, the order of results varies (in the absence of an 
<codeph>ORDER BY</codeph>
       clause) because of the way distributed queries are processed:
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo
+<codeblock>select distinct x from sample_demo
   tablesample system(50) repeatable (12345);
 +---+
 | x |
@@ -386,17 +367,13 @@ select distinct x from sample_demo
 </codeblock>
 
     <p>
-      The following examples show how uneven data distribution affects
-      which data is sampled. Adding another data file containing a long
-      string value changes the threshold for 50% of the total data in
-      the table:
+      The following examples show how uneven data distribution affects which 
data is sampled.
+      Adding another data file containing a long string value changes the 
threshold for 50% of
+      the total data in the table:
     </p>
 
-<codeblock><![CDATA[
-insert into sample_demo values (1000, 'Boyhood is the longest time in li
-fe for a boy. The last term of the school-year is made of decades, not o
-f weeks, and living through them is like waiting for the millennium. Boo
-th Tarkington');
+<codeblock>insert into sample_demo values
+(1000, 'Boyhood is the longest time in life for a boy. The last term of the 
school-year is made of decades, not of weeks, and living through them is like 
waiting for the millennium. Booth Tarkington');
 
 show files in sample_demo;
 +---------------------+------+-----------+
@@ -419,17 +396,15 @@ show table stats sample_demo;
 </codeblock>
 
     <p>
-      Even though the queries do not refer to the <codeph>S</codeph>
-      column containing the long value, all the sampling queries include
-      the data file containing the column value <codeph>X=1000</codeph>,
-      because the query cannot reach the 50% threshold (115 bytes) without
-      including that file. The large file might be considered first, in which
-      case it is the only file processed by the query. Or an arbitrary
-      set of other files might be considered first.
+      Even though the queries do not refer to the <codeph>S</codeph> column 
containing the long
+      value, all the sampling queries include the data file containing the 
column value
+      <codeph>X=1000</codeph>, because the query cannot reach the 50% 
threshold (115 bytes)
+      without including that file. The large file might be considered first, 
in which case it is
+      the only file processed by the query. Or an arbitrary set of other files 
might be
+      considered first.
     </p>
 
-<codeblock><![CDATA[
-select distinct x from sample_demo tablesample system(50);
+<codeblock>select distinct x from sample_demo tablesample system(50);
 +------+
 | x    |
 +------+
@@ -453,17 +428,14 @@ select distinct x from sample_demo tablesample system(50);
 | 4    |
 | 2    |
 | 1    |
-+------+
-</codeblock>
++------+</codeblock>
 
     <p>
-      The following examples demonstrate how the <codeph>TABLESAMPLE</codeph>
-      clause interacts with other table aspects, such as partitioning and file
-      format:
+      The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> 
clause interacts
+      with other table aspects, such as partitioning and file format:
     </p>
 
-<codeblock><![CDATA[
-create table sample_demo_partitions (x int, s string) partitioned by (n int) 
stored as parquet;
+<codeblock>create table sample_demo_partitions (x int, s string) partitioned 
by (n int) stored as parquet;
 
 insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
 insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
@@ -493,12 +465,11 @@ show table stats tablesample_demo_partitioned;
 </codeblock>
 
     <p>
-      If the query does not involve any partition pruning, the
-      sampling applies to the data volume of the entire table:
+      If the query does not involve any partition pruning, the sampling 
applies to the data
+      volume of the entire table:
     </p>
 
-<codeblock><![CDATA[
--- 18 rows total.
+<codeblock>-- 18 rows total.
 select count(*) from sample_demo_partitions;
 +----------+
 | count(*) |
@@ -524,18 +495,14 @@ select count(*) from sample_demo_partitions
 | count(*) |
 +----------+
 | 16       |
-+----------+
-</codeblock>
++----------+</codeblock>
 
     <p>
-      If the query only processes certain partitions,
-      the query computes the sampling threshold based on
-      the data size and set of files only from the
-      relevant partitions:
+      If the query only processes certain partitions, the query computes the 
sampling threshold
+      based on the data size and set of files only from the relevant 
partitions:
     </p>
 
-<codeblock><![CDATA[
-select count(*) from sample_demo_partitions
+<codeblock>select count(*) from sample_demo_partitions
   tablesample system(50) where n = 1;
 +----------+
 | count(*) |
@@ -550,13 +517,14 @@ select count(*) from sample_demo_partitions
 +----------+
 | 2        |
 +----------+
-]]>
 </codeblock>
 
     <p conref="../shared/impala_common.xml#common/related_info"/>
+
     <p>
       <xref href="impala_select.xml#select"/>
     </p>
 
   </conbody>
+
 </concept>

[impala] 01/03: [DOCS] Format fixes in the TABLESAMPLE doc

Reply via email to