[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

GitBox Mon, 17 Feb 2020 22:18:14 -0800

cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add 
the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380471798


 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names 
and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that 
make use of the runtime statistics to choose the most efficient query execution 
plan. AQE is disabled by default. Spark SQL can use the umbrella configuration 
of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are 
three mainly feature in AQE, including coalescing post partition number, 
optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Num
+ This feature coalesces the post shuffle partitions based on the map output 
statistics when `spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration 
properties are both enabled. There are four following sub-configurations in 
this optimization rule. 
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark 
will reduce the post shuffle partitions number based on the map output 
statistics.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.shuffle.minNumPostShufflePartitions</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when 
<code>spark.sql.adaptive.enabled</code> and 
<code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are 
both enabled. It is suggested to be almost 2~3x of the parallelism when doing 
benchmark.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.shuffle.maxNumPostShufflePartitions</code></td>
+     <td>Int.MaxValue</td>
+     <td>
+       The advisory maximum number of post-shuffle partitions used in adaptive 
execution. This is used as the initial number of pre-shuffle partitions. By 
default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.shuffle.targetPostShuffleInputSize</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when 
<code>spark.sql.adaptive.enabled</code> and 
<code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are 
both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when 
converting the sort merge join to broadcast hash join in runtime and no 
additional shuffle introduced. It takes effect when 
`spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.shuffle.localShuffleReader.enabled` configuration 
properties are both enabled.
 
 Review comment:
   ditto, users care more about the benefit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Reply via email to