[03/10] impala git commit: IMPALA-6408: [DOCS] Add a missing info about SHUFFLE

tarmstrong Wed, 13 Jun 2018 08:48:45 -0700

IMPALA-6408: [DOCS] Add a missing info about SHUFFLE

Change-Id: I5738557354c384aab983f64722dde5944037aed7
Reviewed-on: http://gerrit.cloudera.org:8080/10685
Reviewed-by: Csaba Ringhofer <csringho...@cloudera.com>
Reviewed-by: Alex Rodoni <arod...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>



Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/342b2f4b
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/342b2f4b
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/342b2f4b

Branch: refs/heads/2.x
Commit: 342b2f4ba826c560f40da57a4108aaac090e81d9
Parents: ab7ac5b
Author: Alex Rodoni <arod...@cloudera.com>
Authored: Mon Jun 11 10:30:00 2018 -0700
Committer: Impala Public Jenkins <impala-public-jenk...@gerrit.cloudera.org>
Committed: Wed Jun 13 03:10:18 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_hints.xml   | 386 +++++++++++++++++++-----------------
 docs/topics/impala_parquet.xml | 193 ++++++++----------
 2 files changed, 286 insertions(+), 293 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/342b2f4b/docs/topics/impala_hints.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_hints.xml b/docs/topics/impala_hints.xml
index b936b6a..2bdcac1 100644
--- a/docs/topics/impala_hints.xml
+++ b/docs/topics/impala_hints.xml
@@ -21,7 +21,13 @@ under the License.
 <concept id="hints">
 
   <title>Optimizer Hints</title>
-  <titlealts audience="PDF"><navtitle>Optimizer Hints</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>Optimizer Hints</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -37,40 +43,40 @@ under the License.
   <conbody>
 
     <p>
-      <indexterm audience="hidden">hints</indexterm> The Impala SQL supports
-      query hints, for fine-tuning the inner workings of queries. Specify hints
-      as a temporary workaround for expensive queries, where missing statistics
-      or other factors cause inefficient performance. </p>
+      <indexterm audience="hidden">hints</indexterm>
+      The Impala SQL supports query hints, for fine-tuning the inner workings 
of queries.
+      Specify hints as a temporary workaround for expensive queries, where 
missing statistics or
+      other factors cause inefficient performance.
+    </p>
 
-    <p> Hints are most often used for the resource-intensive Impala queries,
-      such as: </p>
+    <p>
+      Hints are most often used for the resource-intensive Impala queries, 
such as:
+    </p>
 
     <ul>
       <li>
-        Join queries involving large tables, where intermediate result sets 
are transmitted across the network to
-        evaluate the join conditions.
+        Join queries involving large tables, where intermediate result sets 
are transmitted
+        across the network to evaluate the join conditions.
       </li>
 
       <li>
-        Inserting into partitioned Parquet tables, where many memory buffers 
could be allocated on each host to
-        hold intermediate results for each partition.
+        Inserting into partitioned Parquet tables, where many memory buffers 
could be allocated
+        on each host to hold intermediate results for each partition.
       </li>
     </ul>
 
     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
 
-    <p rev="2.0.0"> In <keyword keyref="impala20_full"/> and higher, you can
-      specify the hints inside comments that use either the <codeph>/*
-        */</codeph> or <codeph>--</codeph> notation. Specify a
-        <codeph>+</codeph> symbol immediately before the hint name. Recently
-      added hints are only available using the <codeph>/* */</codeph> and
-        <codeph>--</codeph> notation. For clarity, the <codeph>/* */</codeph>
-      and <codeph>--</codeph> styles are used in the syntax and examples
-      throughout this section. With the <codeph>/* */</codeph> or
-        <codeph>--</codeph> notation for hints, specify a <codeph>+</codeph>
-      symbol immediately before the first hint name. Multiple hints can be
-      specified separated by commas, for example <codeph>/* +clustered,shuffle
-        */</codeph>
+    <p rev="2.0.0">
+      In <keyword keyref="impala20_full"/> and higher, you can specify the 
hints inside comments
+      that use either the <codeph>/* */</codeph> or <codeph>--</codeph> 
notation. Specify a
+      <codeph>+</codeph> symbol immediately before the hint name. Recently 
added hints are only
+      available using the <codeph>/* */</codeph> and <codeph>--</codeph> 
notation. For clarity,
+      the <codeph>/* */</codeph> and <codeph>--</codeph> styles are used in 
the syntax and
+      examples throughout this section. With the <codeph>/* */</codeph> or 
<codeph>--</codeph>
+      notation for hints, specify a <codeph>+</codeph> symbol immediately 
before the first hint
+      name. Multiple hints can be specified separated by commas, for example 
<codeph>/*
+      +clustered,shuffle */</codeph>
     </p>
 
 <codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
@@ -159,11 +165,14 @@ CREATE -- +CLUSTER|NOCLUSTER
   <varname>table_clauses</varname>
   AS SELECT <varname>remainder_of_query</varname>;
 </codeblock>
-    <p>The square bracket style hints are supported for backward compatibility,
-      but the syntax is deprecated and will be removed in a future release. For
-      that reason, any newly added hints are not available with the square
-      bracket syntax.</p>
-    <codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
+
+    <p>
+      The square bracket style hints are supported for backward compatibility, 
but the syntax is
+      deprecated and will be removed in a future release. For that reason, any 
newly added hints
+      are not available with the square bracket syntax.
+    </p>
+
+<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
 <varname>join_left_hand_table</varname>
   JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
 <varname>join_right_hand_table</varname>
@@ -184,26 +193,28 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
 
     <p>
-      With both forms of hint syntax, include the 
<codeph>STRAIGHT_JOIN</codeph>
-      keyword immediately after the <codeph>SELECT</codeph> and any
-      <codeph>DISTINCT</codeph> or <codeph>ALL</codeph> keywords to prevent 
Impala from
-      reordering the tables in a way that makes the join-related hints 
ineffective.
+      With both forms of hint syntax, include the 
<codeph>STRAIGHT_JOIN</codeph> keyword
+      immediately after the <codeph>SELECT</codeph> and any 
<codeph>DISTINCT</codeph> or
+      <codeph>ALL</codeph> keywords to prevent Impala from reordering the 
tables in a way that
+      makes the join-related hints ineffective.
     </p>
 
     <p 
conref="../shared/impala_common.xml#common/straight_join_nested_queries"/>
 
     <p>
-      To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> 
statement against all tables involved
-      in joins, or used as the source tables for <codeph>INSERT ... 
SELECT</codeph> operations where the
-      destination is a partitioned Parquet table. Do this operation after 
loading data or making substantial
-      changes to the data within each table. Having up-to-date statistics 
helps Impala choose more efficient query
-      plans without the need for hinting. See <xref 
href="impala_perf_stats.xml#perf_stats"/> for details and
+      To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> 
statement against
+      all tables involved in joins, or used as the source tables for 
<codeph>INSERT ...
+      SELECT</codeph> operations where the destination is a partitioned 
Parquet table. Do this
+      operation after loading data or making substantial changes to the data 
within each table.
+      Having up-to-date statistics helps Impala choose more efficient query 
plans without the
+      need for hinting. See <xref href="impala_perf_stats.xml#perf_stats"/> 
for details and
       examples.
     </p>
 
     <p>
-      To see which join strategy is used for a particular query, examine the 
<codeph>EXPLAIN</codeph> output for
-      that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for 
details and examples.
+      To see which join strategy is used for a particular query, examine the
+      <codeph>EXPLAIN</codeph> output for that query. See
+      <xref href="impala_explain_plan.xml#perf_explain"/> for details and 
examples.
     </p>
 
     <p>
@@ -211,45 +222,50 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     </p>
 
     <p>
-      The <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE 
*/</codeph> hints control the execution strategy for join
-      queries. Specify one of the following constructs immediately after the 
<codeph>JOIN</codeph> keyword in a
-      query:
+      The <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE 
*/</codeph> hints control
+      the execution strategy for join queries. Specify one of the following 
constructs
+      immediately after the <codeph>JOIN</codeph> keyword in a query:
     </p>
 
     <ul>
       <li>
-        <codeph>/* +SHUFFLE */</codeph> - Makes that join operation use the 
<q>partitioned</q> technique, which divides
-        up corresponding rows from both tables using a hashing algorithm, 
sending subsets of the rows to other
-        nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to 
indicate a <q>partitioned join</q>,
-        because that type of join is not related to <q>partitioned 
tables</q>.) Since the alternative
-        <q>broadcast</q> join mechanism is the default when table and index 
statistics are unavailable, you might
-        use this hint for queries where broadcast joins are unsuitable; 
typically, partitioned joins are more
-        efficient for joins between large tables of similar size.
+        <codeph>/* +SHUFFLE */</codeph> makes that join operation use the 
<q>partitioned</q>
+        technique, which divides up corresponding rows from both tables using 
a hashing
+        algorithm, sending subsets of the rows to other nodes for processing. 
(The keyword
+        <codeph>SHUFFLE</codeph> is used to indicate a <q>partitioned 
join</q>, because that
+        type of join is not related to <q>partitioned tables</q>.) Since the 
alternative
+        <q>broadcast</q> join mechanism is the default when table and index 
statistics are
+        unavailable, you might use this hint for queries where broadcast joins 
are unsuitable;
+        typically, partitioned joins are more efficient for joins between 
large tables of
+        similar size.
       </li>
 
       <li>
-        <codeph>/* +BROADCAST */</codeph> - Makes that join operation use the 
<q>broadcast</q> technique that sends the
-        entire contents of the right-hand table to all nodes involved in 
processing the join. This is the default
-        mode of operation when table and index statistics are unavailable, so 
you would typically only need it if
-        stale metadata caused Impala to mistakenly choose a partitioned join 
operation. Typically, broadcast joins
-        are more efficient in cases where one table is much smaller than the 
other. (Put the smaller table on the
-        right side of the <codeph>JOIN</codeph> operator.)
+        <codeph>/* +BROADCAST */</codeph> makes that join operation use the 
<q>broadcast</q>
+        technique that sends the entire contents of the right-hand table to 
all nodes involved
+        in processing the join. This is the default mode of operation when 
table and index
+        statistics are unavailable, so you would typically only need it if 
stale metadata caused
+        Impala to mistakenly choose a partitioned join operation. Typically, 
broadcast joins are
+        more efficient in cases where one table is much smaller than the 
other. (Put the smaller
+        table on the right side of the <codeph>JOIN</codeph> operator.)
       </li>
     </ul>
 
     <p>
       <b>Hints for INSERT ... SELECT and CREATE TABLE AS SELECT (CTAS):</b>
     </p>
+
     <p id="insert_hints">
-      When inserting into partitioned tables, such as using the Parquet file
-      format, you can include a hint in the <codeph>INSERT</codeph> or 
<codeph>CREATE TABLE AS SELECT(CTAS)</codeph>
-      statements to fine-tune the overall performance of the operation and its
-      resource usage.</p>
+      When inserting into partitioned tables, such as using the Parquet file 
format, you can
+      include a hint in the <codeph>INSERT</codeph> or <codeph>CREATE TABLE AS
+      SELECT(CTAS)</codeph> statements to fine-tune the overall performance of 
the operation and
+      its resource usage.
+    </p>
+
     <p>
-      You would only use hints if an <codeph>INSERT</codeph> or
-        <codeph>CTAS</codeph> into a partitioned table was failing due to
-      capacity limits, or if such an operation was succeeding but with
-      less-than-optimal performance.
+      You would only use hints if an <codeph>INSERT</codeph> or 
<codeph>CTAS</codeph> into a
+      partitioned table was failing due to capacity limits, or if such an 
operation was
+      succeeding but with less-than-optimal performance.
     </p>
 
     <ul>
@@ -257,95 +273,83 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
         <codeph>/* +SHUFFLE */</codeph> and <codeph>/* +NOSHUFFLE */</codeph> 
Hints
         <ul>
           <li>
-            <codeph>/* +SHUFFLE */</codeph> adds an exchange node, before
-            writing the data, which re-partitions the result of the
-              <codeph>SELECT</codeph> based on the partitioning columns of the
-            target table. With this hint, only one node writes to a partition 
at
-            a time, minimizing the global number of simultaneous writes and the
-            number of memory buffers holding data for individual partitions.
-            This also reduces fragmentation, resulting in fewer files. Thus it
-            reduces overall resource usage of the <codeph>INSERT</codeph> or
-              <codeph>CTAS</codeph> operation and allows some operations to
-            succeed that otherwise would fail. It does involve some data
-            transfer between the nodes so that the data files for a particular
-            partition are all written on the same node.
-
+            <codeph>/* +SHUFFLE */</codeph> adds an exchange node, before 
writing the data,
+            which re-partitions the result of the <codeph>SELECT</codeph> 
based on the
+            partitioning columns of the target table. With this hint, only one 
node writes to a
+            partition at a time, minimizing the global number of simultaneous 
writes and the
+            number of memory buffers holding data for individual partitions. 
This also reduces
+            fragmentation, resulting in fewer files. Thus it reduces overall 
resource usage of
+            the <codeph>INSERT</codeph> or <codeph>CTAS</codeph> operation and 
allows some
+            operations to succeed that otherwise would fail. It does involve 
some data transfer
+            between the nodes so that the data files for a particular 
partition are all written
+            on the same node.
             <p>
-              Use <codeph>/* +SHUFFLE */</codeph> in cases where an 
<codeph>INSERT</codeph>
-              or <codeph>CTAS</codeph> statement fails or runs inefficiently 
due
-              to all nodes attempting to write data for all partitions.
+              Use <codeph>/* +SHUFFLE */</codeph> in cases where an 
<codeph>INSERT</codeph> or
+              <codeph>CTAS</codeph> statement fails or runs inefficiently due 
to all nodes
+              attempting to write data for all partitions.
             </p>
 
-            <p> If the table is unpartitioned or every partitioning expression
-              is constant, then <codeph>/* +SHUFFLE */</codeph> will cause 
every
-              write to happen on the coordinator node.
+            <p>
+              If the table is unpartitioned or every partitioning expression 
is constant, then
+              <codeph>/* +SHUFFLE */</codeph> will cause every write to happen 
on the
+              coordinator node.
             </p>
           </li>
 
           <li>
-            <codeph>/* +NOSHUFFLE */</codeph> does not add exchange node before
-            inserting to partitioned tables and disables re-partitioning. So 
the
-            selected execution plan might be faster overall, but might also
-            produce a larger number of small data files or exceed capacity
-            limits, causing the <codeph>INSERT</codeph> or 
<codeph>CTAS</codeph>
-            operation to fail.
-
-            <p> Impala automatically uses the <codeph>/*
-                +SHUFFLE */</codeph> method if any partition key column in the
-              source table, mentioned in the <codeph>SELECT</codeph> clause,
-              does not have column statistics. In this case, use the <codeph>/*
-                +NOSHUFFLE */</codeph> hint if you want to override this 
default
-              behavior.
+            <codeph>/* +NOSHUFFLE */</codeph> does not add exchange node 
before inserting to
+            partitioned tables and disables re-partitioning. So the selected 
execution plan
+            might be faster overall, but might also produce a larger number of 
small data files
+            or exceed capacity limits, causing the <codeph>INSERT</codeph> or
+            <codeph>CTAS</codeph> operation to fail.
+            <p>
+              Impala automatically uses the <codeph>/* +SHUFFLE */</codeph> 
method if any
+              partition key column in the source table, mentioned in the 
<codeph>SELECT</codeph>
+              clause, does not have column statistics. In this case, use the 
<codeph>/*
+              +NOSHUFFLE */</codeph> hint if you want to override this default 
behavior.
             </p>
           </li>
 
           <li>
-            If column statistics are available for all partition key columns
-            in the source table mentioned in the <codeph>INSERT ...
-              SELECT</codeph> or <codeph>CTAS</codeph> query, Impala chooses
-            whether to use the <codeph>/* +SHUFFLE */</codeph> or <codeph>/*
-              +NOSHUFFLE */</codeph> technique based on the estimated number of
-            distinct values in those columns and the number of nodes involved 
in
-            the operation. In this case, you might need the <codeph>/* +SHUFFLE
-              */</codeph> or the <codeph>/* +NOSHUFFLE */</codeph> hint to
-            override the execution plan selected by Impala.
+            If column statistics are available for all partition key columns 
in the source table
+            mentioned in the <codeph>INSERT ... SELECT</codeph> or 
<codeph>CTAS</codeph> query,
+            Impala chooses whether to use the <codeph>/* +SHUFFLE */</codeph> 
or <codeph>/*
+            +NOSHUFFLE */</codeph> technique based on the estimated number of 
distinct values in
+            those columns and the number of nodes involved in the operation. 
In this case, you
+            might need the <codeph>/* +SHUFFLE */</codeph> or the <codeph>/* 
+NOSHUFFLE
+            */</codeph> hint to override the execution plan selected by Impala.
           </li>
         </ul>
       </li>
 
       <li>
-        <codeph>/* +CLUSTERED */</codeph> and <codeph>/* +NOCLUSTERED
-          */</codeph> Hints
+        <codeph>/* +CLUSTERED */</codeph> and <codeph>/* +NOCLUSTERED 
*/</codeph> Hints
         <ul>
           <li>
-            <codeph>/* +CLUSTERED */</codeph> sorts data by the partition
-            columns before inserting to ensure that only one partition is
-            written at a time per node. Use this hint to reduce the number of
-            files kept open and the number of buffers kept in memory
-            simultaneously. This technique is primarily useful for inserts into
-            Parquet tables, where the large block size requires substantial
-            memory to buffer data for multiple output files at once. This hint
-            is available in <keyword keyref="impala28_full"/> or higher.
-
+            <codeph>/* +CLUSTERED */</codeph> sorts data by the partition 
columns before
+            inserting to ensure that only one partition is written at a time 
per node. Use this
+            hint to reduce the number of files kept open and the number of 
buffers kept in
+            memory simultaneously. This technique is primarily useful for 
inserts into Parquet
+            tables, where the large block size requires substantial memory to 
buffer data for
+            multiple output files at once. This hint is available in
+            <keyword keyref="impala28_full"/> or higher.
             <p>
-              Starting in <keyword keyref="impala30_full"/>, <codeph>/*
-                +CLUSTERED */</codeph> is the default behavior for HDFS tables.
+              Starting in <keyword keyref="impala30_full"/>, <codeph>/* 
+CLUSTERED */</codeph>
+              is the default behavior for HDFS tables.
             </p>
           </li>
 
           <li>
-            <codeph>/* +NOCLUSTERED */</codeph> does not sort by primary key
-            before insert. This hint is available in <keyword
+            <codeph>/* +NOCLUSTERED */</codeph> does not sort by primary key 
before insert. This
+            hint is available in <keyword
               keyref="impala28_full"/> or higher.
-
             <p>
               Use this hint when inserting to Kudu tables.
             </p>
 
             <p>
-              In the versions lower than <keyword keyref="impala30_full"/>,
-                <codeph>/* +NOCLUSTERED */</codeph> is the default in HDFS
-              tables.
+              In the versions lower than <keyword keyref="impala30_full"/>, 
<codeph>/*
+              +NOCLUSTERED */</codeph> is the default in HDFS tables.
             </p>
           </li>
         </ul>
@@ -353,14 +357,13 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     </ul>
 
     <p>
-      Starting from <keyword keyref="impala29_full"/>, <codeph>INSERT</codeph>
-      or <codeph>UPSERT</codeph> operations into Kudu tables automatically have
-      an exchange and sort node added to the plan that partitions and sorts the
-      rows according to the partitioning/primary key scheme of the target table
-      (unless the number of rows to be inserted is small enough to trigger
-      single node execution). Use the<codeph> /* +NOCLUSTERED */</codeph> and
-        <codeph>/* +NOSHUFFLE */</codeph> hints together to disable 
partitioning
-      and sorting before the rows are sent to Kudu.
+      Starting from <keyword keyref="impala29_full"/>, <codeph>INSERT</codeph> 
or
+      <codeph>UPSERT</codeph> operations into Kudu tables automatically have 
an exchange and
+      sort node added to the plan that partitions and sorts the rows according 
to the
+      partitioning/primary key scheme of the target table (unless the number 
of rows to be
+      inserted is small enough to trigger single node execution). Use 
the<codeph> /*
+      +NOCLUSTERED */</codeph> and <codeph>/* +NOSHUFFLE */</codeph> hints 
together to disable
+      partitioning and sorting before the rows are sent to Kudu.
     </p>
 
     <p rev="IMPALA-2924">
@@ -368,30 +371,26 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     </p>
 
     <p rev="IMPALA-2924">
-      The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>,
-      <codeph>/* +SCHEDULE_DISK_LOCAL */</codeph>, and
-      <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same effect
-      as specifying the <codeph>REPLICA_PREFERENCE</codeph> query
-      option with the respective option settings of 
<codeph>CACHE_LOCAL</codeph>,
-      <codeph>DISK_LOCAL</codeph>, or <codeph>REMOTE</codeph>.
-      The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the same as
+      The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>, <codeph>/* 
+SCHEDULE_DISK_LOCAL
+      */</codeph>, and <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same 
effect as
+      specifying the <codeph>REPLICA_PREFERENCE</codeph> query option with the 
respective option
+      settings of <codeph>CACHE_LOCAL</codeph>, <codeph>DISK_LOCAL</codeph>, or
+      <codeph>REMOTE</codeph>. The hint <codeph>/* +RANDOM_REPLICA */</codeph> 
is the same as
       enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
     </p>
 
     <p rev="IMPALA-2924">
-      You can use these hints in combination by separating them with commas,
-      for example, <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>.
-      See <xref keyref="replica_preference"/> and
-      <xref keyref="schedule_random_replica"/> for information about how
-      these settings influence the way Impala processes HDFS data blocks.
+      You can use these hints in combination by separating them with commas, 
for example,
+      <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. See
+      <xref keyref="replica_preference"/> and <xref 
keyref="schedule_random_replica"/> for
+      information about how these settings influence the way Impala processes 
HDFS data blocks.
     </p>
 
     <p rev="IMPALA-2924">
-      Specifying the replica preference as a query hint always overrides the
-      query option setting. Specifying either the 
<codeph>SCHEDULE_RANDOM_REPLICA</codeph>
-      query option or the corresponding <codeph>RANDOM_REPLICA</codeph> query 
hint
-      enables the random tie-breaking behavior when processing data blocks
-      during the query.
+      Specifying the replica preference as a query hint always overrides the 
query option
+      setting. Specifying either the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> 
query option or
+      the corresponding <codeph>RANDOM_REPLICA</codeph> query hint enables the 
random
+      tie-breaking behavior when processing data blocks during the query.
     </p>
 
     <p>
@@ -399,65 +398,74 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
     </p>
 
     <p>
-      In early Impala releases, hints were always obeyed and so acted more 
like directives. Once Impala gained join
-      order optimizations, sometimes join queries were automatically reordered 
in a way that made a hint
-      irrelevant. Therefore, the hints act more like suggestions in Impala 
1.2.2 and higher.
+      In early Impala releases, hints were always obeyed and so acted more 
like directives. Once
+      Impala gained join order optimizations, sometimes join queries were 
automatically
+      reordered in a way that made a hint irrelevant. Therefore, the hints act 
more like
+      suggestions in Impala 1.2.2 and higher.
     </p>
 
     <p>
       To force Impala to follow the hinted execution mechanism for a join 
query, include the
       <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> 
statement. See
-      <xref href="impala_perf_joins.xml#straight_join"/> for details. When you 
use this technique, Impala does not
-      reorder the joined tables at all, so you must be careful to arrange the 
join order to put the largest table
-      (or subquery result set) first, then the smallest, second smallest, 
third smallest, and so on. This ordering lets Impala do the
-      most I/O-intensive parts of the query using local reads on the 
DataNodes, and then reduce the size of the
-      intermediate result set as much as possible as each subsequent table or 
subquery result set is joined.
+      <xref href="impala_perf_joins.xml#straight_join"/> for details. When you 
use this
+      technique, Impala does not reorder the joined tables at all, so you must 
be careful to
+      arrange the join order to put the largest table (or subquery result set) 
first, then the
+      smallest, second smallest, third smallest, and so on. This ordering lets 
Impala do the
+      most I/O-intensive parts of the query using local reads on the 
DataNodes, and then reduce
+      the size of the intermediate result set as much as possible as each 
subsequent table or
+      subquery result set is joined.
     </p>
 
     <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
 
     <p>
-      Queries that include subqueries in the <codeph>WHERE</codeph> clause can 
be rewritten internally as join
-      queries. Currently, you cannot apply hints to the joins produced by 
these types of queries.
+      Queries that include subqueries in the <codeph>WHERE</codeph> clause can 
be rewritten
+      internally as join queries. Currently, you cannot apply hints to the 
joins produced by
+      these types of queries.
     </p>
 
     <p>
-      Because hints can prevent queries from taking advantage of new metadata 
or improvements in query planning,
-      use them only when required to work around performance issues, and be 
prepared to remove them when they are
-      no longer required, such as after a new Impala release or bug fix.
+      Because hints can prevent queries from taking advantage of new metadata 
or improvements in
+      query planning, use them only when required to work around performance 
issues, and be
+      prepared to remove them when they are no longer required, such as after 
a new Impala
+      release or bug fix.
     </p>
 
     <p>
-      In particular, the <codeph>/* +BROADCAST */</codeph> and <codeph>/* 
+SHUFFLE */</codeph> hints are expected to be
-      needed much less frequently in Impala 1.2.2 and higher, because the join 
order optimization feature in
-      combination with the <codeph>COMPUTE STATS</codeph> statement now 
automatically choose join order and join
-      mechanism without the need to rewrite the query and add hints. See
-      <xref href="impala_perf_joins.xml#perf_joins"/> for details.
+      In particular, the <codeph>/* +BROADCAST */</codeph> and <codeph>/* 
+SHUFFLE */</codeph>
+      hints are expected to be needed much less frequently in Impala 1.2.2 and 
higher, because
+      the join order optimization feature in combination with the 
<codeph>COMPUTE STATS</codeph>
+      statement now automatically choose join order and join mechanism without 
the need to
+      rewrite the query and add hints. See <xref 
href="impala_perf_joins.xml#perf_joins"/> for
+      details.
     </p>
 
     <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
 
     <p rev="2.0.0">
-      The hints embedded within <codeph>--</codeph> comments are compatible 
with Hive queries. The hints embedded
-      within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square 
brackets are not recognized by or not
-      compatible with Hive. For example, Hive raises an error for Impala hints 
within <codeph>/* */</codeph>
-      comments because it does not recognize the Impala hint names.
+      The hints embedded within <codeph>--</codeph> comments are compatible 
with Hive queries.
+      The hints embedded within <codeph>/* */</codeph> comments or <codeph>[ 
]</codeph> square
+      brackets are not recognized by or not compatible with Hive. For example, 
Hive raises an
+      error for Impala hints within <codeph>/* */</codeph> comments because it 
does not
+      recognize the Impala hint names.
     </p>
 
     <p conref="../shared/impala_common.xml#common/view_blurb"/>
 
     <p rev="2.0.0">
-      If you use a hint in the query that defines a view, the hint is 
preserved when you query the view. Impala
-      internally rewrites all hints in views to use the <codeph>--</codeph> 
comment notation, so that Hive can
-      query such views without errors due to unrecognized hint names.
+      If you use a hint in the query that defines a view, the hint is 
preserved when you query
+      the view. Impala internally rewrites all hints in views to use the 
<codeph>--</codeph>
+      comment notation, so that Hive can query such views without errors due 
to unrecognized
+      hint names.
     </p>
 
     <p conref="../shared/impala_common.xml#common/example_blurb"/>
 
     <p>
-      For example, this query joins a large customer table with a small lookup 
table of less than 100 rows. The
-      right-hand table can be broadcast efficiently to all nodes involved in 
the join. Thus, you would use the
-      <codeph>/* +broadcast */</codeph> hint to force a broadcast join 
strategy:
+      For example, this query joins a large customer table with a small lookup 
table of less
+      than 100 rows. The right-hand table can be broadcast efficiently to all 
nodes involved in
+      the join. Thus, you would use the <codeph>/* +broadcast */</codeph> hint 
to force a
+      broadcast join strategy:
     </p>
 
 <codeblock>select straight_join customer.address, state_lookup.state_name
@@ -465,9 +473,10 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
   on customer.state_id = state_lookup.state_id;</codeblock>
 
     <p>
-      This query joins two large tables of unpredictable size. You might 
benchmark the query with both kinds of
-      hints and find that it is more efficient to transmit portions of each 
table to other nodes for processing.
-      Thus, you would use the <codeph>/* +shuffle */</codeph> hint to force a 
partitioned join strategy:
+      This query joins two large tables of unpredictable size. You might 
benchmark the query
+      with both kinds of hints and find that it is more efficient to transmit 
portions of each
+      table to other nodes for processing. Thus, you would use the <codeph>/* 
+shuffle
+      */</codeph> hint to force a partitioned join strategy:
     </p>
 
 <codeblock>select straight_join weather.wind_velocity, geospatial.altitude
@@ -475,24 +484,27 @@ UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
   on weather.lat = geospatial.lat and weather.long = 
geospatial.long;</codeblock>
 
     <p>
-      For joins involving three or more tables, the hint applies to the tables 
on either side of that specific
-      <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> 
keyword ensures that joins are processed
-      in a predictable order from left to right. For example, this query joins
-      <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, 
then joins that result set to
-      <codeph>t3</codeph> using a broadcast join:
+      For joins involving three or more tables, the hint applies to the tables 
on either side of
+      that specific <codeph>JOIN</codeph> keyword. The 
<codeph>STRAIGHT_JOIN</codeph> keyword
+      ensures that joins are processed in a predictable order from left to 
right. For example,
+      this query joins <codeph>t1</codeph> and <codeph>t2</codeph> using a 
partitioned join,
+      then joins that result set to <codeph>t3</codeph> using a broadcast join:
     </p>
 
 <codeblock>select straight_join t1.name, t2.id, t3.price
   from t1 join <b>/* +shuffle */</b> t2 join <b>/* +broadcast */</b> t3
   on t1.id = t2.id and t2.id = t3.id;</codeblock>
 
-    <!-- To do: This is a good place to add more sample output showing before 
and after EXPLAIN plans. -->
+<!-- To do: This is a good place to add more sample output showing before and 
after EXPLAIN plans. -->
 
     <p conref="../shared/impala_common.xml#common/related_info"/>
 
     <p>
-      For more background information about join queries, see <xref 
href="impala_joins.xml#joins"/>. For
-      performance considerations, see <xref 
href="impala_perf_joins.xml#perf_joins"/>.
+      For more background information about join queries, see
+      <xref href="impala_joins.xml#joins"/>. For performance considerations, 
see
+      <xref href="impala_perf_joins.xml#perf_joins"/>.
     </p>
+
   </conbody>
+
 </concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/342b2f4b/docs/topics/impala_parquet.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet.xml b/docs/topics/impala_parquet.xml
index af6ab73..3667560 100644
--- a/docs/topics/impala_parquet.xml
+++ b/docs/topics/impala_parquet.xml
@@ -188,118 +188,99 @@ This <codeph>PARQUET</codeph> keyword is recommended for 
new code.
   </prolog>
 
     <conbody>
-
-      <p>
-        Choose from the following techniques for loading data into Parquet 
tables, depending on whether the
-        original data is already in an Impala table, or exists as raw data 
files outside Impala.
-      </p>
-
-      <p>
-        If you already have data in an Impala or Hive table, perhaps in a 
different file format or partitioning
-        scheme, you can transfer the data to a Parquet table using the Impala 
<codeph>INSERT...SELECT</codeph>
-        syntax. You can convert, filter, repartition, and do other things to 
the data as part of this same
-        <codeph>INSERT</codeph> statement. See <xref 
href="#parquet_compression"/> for some examples showing how to
-        insert data into Parquet tables.
-      </p>
-
-      <p conref="../shared/impala_common.xml#common/insert_hints"/>
-
+      <p> Choose from the following techniques for loading data into Parquet
+        tables, depending on whether the original data is already in an Impala
+        table, or exists as raw data files outside Impala. </p>
+      <p> If you already have data in an Impala or Hive table, perhaps in a
+        different file format or partitioning scheme, you can transfer the data
+        to a Parquet table using the Impala <codeph>INSERT...SELECT</codeph>
+        syntax. You can convert, filter, repartition, and do other things to 
the
+        data as part of this same <codeph>INSERT</codeph> statement. See <xref
+          href="#parquet_compression"/> for some examples showing how to insert
+        data into Parquet tables. </p>
+      <p>When inserting into partitioned tables, especially using the Parquet
+        file format, you can include a hint in the <codeph>INSERT</codeph>
+        statement to fine-tune the overall performance of the operation and its
+        resource usage. See <keyword keyref="hints"/> for using hints in the
+          <codeph>INSERT</codeph> statements.</p>
       <p conref="../shared/impala_common.xml#common/insert_parquet_blocksize"/>
-
-      <draft-comment translate="no">
-Add an example here.
-</draft-comment>
-
-      <p>
-        Avoid the <codeph>INSERT...VALUES</codeph> syntax for Parquet tables, 
because
-        <codeph>INSERT...VALUES</codeph> produces a separate tiny data file 
for each
-        <codeph>INSERT...VALUES</codeph> statement, and the strength of 
Parquet is in its handling of data
-        (compressing, parallelizing, and so on) in <ph 
rev="parquet_block_size">large</ph> chunks.
-      </p>
-
-      <p>
-        If you have one or more Parquet data files produced outside of Impala, 
you can quickly make the data
-        queryable through Impala by one of the following methods:
-      </p>
-
+      <draft-comment translate="no"> Add an example here. </draft-comment>
+      <p> Avoid the <codeph>INSERT...VALUES</codeph> syntax for Parquet tables,
+        because <codeph>INSERT...VALUES</codeph> produces a separate tiny data
+        file for each <codeph>INSERT...VALUES</codeph> statement, and the
+        strength of Parquet is in its handling of data (compressing,
+        parallelizing, and so on) in <ph rev="parquet_block_size">large</ph>
+        chunks. </p>
+      <p> If you have one or more Parquet data files produced outside of 
Impala,
+        you can quickly make the data queryable through Impala by one of the
+        following methods: </p>
       <ul>
-        <li>
-          The <codeph>LOAD DATA</codeph> statement moves a single data file or 
a directory full of data files into
-          the data directory for an Impala table. It does no validation or 
conversion of the data. The original
-          data files must be somewhere in HDFS, not the local filesystem.
-          <draft-comment translate="no">
-Add an example here.
-</draft-comment>
+        <li> The <codeph>LOAD DATA</codeph> statement moves a single data file
+          or a directory full of data files into the data directory for an
+          Impala table. It does no validation or conversion of the data. The
+          original data files must be somewhere in HDFS, not the local
+          filesystem. <draft-comment translate="no"> Add an example here.
+          </draft-comment>
         </li>
-
-        <li>
-          The <codeph>CREATE TABLE</codeph> statement with the 
<codeph>LOCATION</codeph> clause creates a table
-          where the data continues to reside outside the Impala data 
directory. The original data files must be
-          somewhere in HDFS, not the local filesystem. For extra safety, if 
the data is intended to be long-lived
-          and reused by other applications, you can use the <codeph>CREATE 
EXTERNAL TABLE</codeph> syntax so that
-          the data files are not deleted by an Impala <codeph>DROP 
TABLE</codeph> statement.
-          <draft-comment translate="no">
-Add an example here.
-</draft-comment>
-        </li>
-
-        <li>
-          If the Parquet table already exists, you can copy Parquet data files 
directly into it, then use the
-          <codeph>REFRESH</codeph> statement to make Impala recognize the 
newly added data. Remember to preserve
-          the block size of the Parquet data files by using the <codeph>hadoop 
distcp -pb</codeph> command rather
-          than a <codeph>-put</codeph> or <codeph>-cp</codeph> operation on 
the Parquet files. See
-          <xref href="#parquet_compression_multiple"/> for an example of this 
kind of operation.
+        <li> The <codeph>CREATE TABLE</codeph> statement with the
+            <codeph>LOCATION</codeph> clause creates a table where the data
+          continues to reside outside the Impala data directory. The original
+          data files must be somewhere in HDFS, not the local filesystem. For
+          extra safety, if the data is intended to be long-lived and reused by
+          other applications, you can use the <codeph>CREATE EXTERNAL
+            TABLE</codeph> syntax so that the data files are not deleted by an
+          Impala <codeph>DROP TABLE</codeph> statement. <draft-comment
+            translate="no"> Add an example here. </draft-comment>
         </li>
+        <li> If the Parquet table already exists, you can copy Parquet data
+          files directly into it, then use the <codeph>REFRESH</codeph>
+          statement to make Impala recognize the newly added data. Remember to
+          preserve the block size of the Parquet data files by using the
+            <codeph>hadoop distcp -pb</codeph> command rather than a
+            <codeph>-put</codeph> or <codeph>-cp</codeph> operation on the
+          Parquet files. See <xref href="#parquet_compression_multiple"/> for 
an
+          example of this kind of operation. </li>
       </ul>
-
-      <note 
conref="../shared/impala_common.xml#common/restrictions_nonimpala_parquet"/>
-
-      <p>
-        Recent versions of Sqoop can produce Parquet output files using the 
<codeph>--as-parquetfile</codeph>
-        option.
-      </p>
-
+      <note
+        
conref="../shared/impala_common.xml#common/restrictions_nonimpala_parquet"/>
+      <p> Recent versions of Sqoop can produce Parquet output files using the
+          <codeph>--as-parquetfile</codeph> option. </p>
       <p conref="../shared/impala_common.xml#common/sqoop_timestamp_caveat"/>
-
-      <p>
-        If the data exists outside Impala and is in some other format, combine 
both of the preceding techniques.
-        First, use a <codeph>LOAD DATA</codeph> or <codeph>CREATE EXTERNAL 
TABLE ... LOCATION</codeph> statement to
-        bring the data into an Impala table that uses the appropriate file 
format. Then, use an
-        <codeph>INSERT...SELECT</codeph> statement to copy the data to the 
Parquet table, converting to Parquet
-        format as part of the process.
-      </p>
-
-      <draft-comment translate="no">
-Add an example here.
-</draft-comment>
-
-      <p>
-        Loading data into Parquet tables is a memory-intensive operation, 
because the incoming data is buffered
-        until it reaches <ph rev="parquet_block_size">one data block</ph> in 
size, then that chunk of data is
-        organized and compressed in memory before being written out. The 
memory consumption can be larger when
-        inserting data into partitioned Parquet tables, because a separate 
data file is written for each
-        combination of partition key column values, potentially requiring 
several
-        <ph rev="parquet_block_size">large</ph> chunks to be manipulated in 
memory at once.
-      </p>
-
-      <p>
-        When inserting into a partitioned Parquet table, Impala redistributes 
the data among the nodes to reduce
-        memory consumption. You might still need to temporarily increase the 
memory dedicated to Impala during the
-        insert operation, or break up the load operation into several 
<codeph>INSERT</codeph> statements, or both.
-      </p>
-
-      <note>
-        All the preceding techniques assume that the data you are loading 
matches the structure of the destination
-        table, including column order, column names, and partition layout. To 
transform or reorganize the data,
-        start by loading the data into a Parquet table that matches the 
underlying structure of the data, then use
-        one of the table-copying techniques such as <codeph>CREATE TABLE AS 
SELECT</codeph> or <codeph>INSERT ...
-        SELECT</codeph> to reorder or rename columns, divide the data among 
multiple partitions, and so on. For
-        example to take a single comprehensive Parquet data file and load it 
into a partitioned table, you would
-        use an <codeph>INSERT ... SELECT</codeph> statement with dynamic 
partitioning to let Impala create separate
-        data files with the appropriate partition values; for an example, see
-        <xref href="impala_insert.xml#insert"/>.
-      </note>
-
+      <p> If the data exists outside Impala and is in some other format, 
combine
+        both of the preceding techniques. First, use a <codeph>LOAD
+          DATA</codeph> or <codeph>CREATE EXTERNAL TABLE ... LOCATION</codeph>
+        statement to bring the data into an Impala table that uses the
+        appropriate file format. Then, use an <codeph>INSERT...SELECT</codeph>
+        statement to copy the data to the Parquet table, converting to Parquet
+        format as part of the process. </p>
+      <draft-comment translate="no"> Add an example here. </draft-comment>
+      <p> Loading data into Parquet tables is a memory-intensive operation,
+        because the incoming data is buffered until it reaches <ph
+          rev="parquet_block_size">one data block</ph> in size, then that chunk
+        of data is organized and compressed in memory before being written out.
+        The memory consumption can be larger when inserting data into
+        partitioned Parquet tables, because a separate data file is written for
+        each combination of partition key column values, potentially requiring
+        several <ph rev="parquet_block_size">large</ph> chunks to be 
manipulated
+        in memory at once. </p>
+      <p> When inserting into a partitioned Parquet table, Impala redistributes
+        the data among the nodes to reduce memory consumption. You might still
+        need to temporarily increase the memory dedicated to Impala during the
+        insert operation, or break up the load operation into several
+          <codeph>INSERT</codeph> statements, or both. </p>
+      <note> All the preceding techniques assume that the data you are loading
+        matches the structure of the destination table, including column order,
+        column names, and partition layout. To transform or reorganize the 
data,
+        start by loading the data into a Parquet table that matches the
+        underlying structure of the data, then use one of the table-copying
+        techniques such as <codeph>CREATE TABLE AS SELECT</codeph> or
+          <codeph>INSERT ... SELECT</codeph> to reorder or rename columns,
+        divide the data among multiple partitions, and so on. For example to
+        take a single comprehensive Parquet data file and load it into a
+        partitioned table, you would use an <codeph>INSERT ... SELECT</codeph>
+        statement with dynamic partitioning to let Impala create separate data
+        files with the appropriate partition values; for an example, see <xref
+          href="impala_insert.xml#insert"/>. </note>
     </conbody>
 
   </concept>

[03/10] impala git commit: IMPALA-6408: [DOCS] Add a missing info about SHUFFLE

Reply via email to