cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add
the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r381286065
##########
File path: docs/sql-performance-tuning.md
##########
@@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names
and a partition number is
SELECT /*+ REPARTITION(3, c) */ * FROM t
SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that
makes use of the runtime statistics to choose the most efficient query
execution plan. AQE is disabled by default. Spark SQL can use the umbrella
configuration of `spark.sql.adaptive.enabled` to control whether turn it
on/off. As of Spark 3.0, there are three major features in AQE, including
coalescing post-shuffle partition number, optimizing local shuffle reader and
optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output
statistics when `spark.sql.adaptive.enabled` and
`spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration
properties are both enabled. There are four following sub-configurations in
this optimization rule. And this feature can bring about 1.28x performance gain
with query 38 in 3TB TPC-DS.
+ <table class="table">
+ <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+ <tr>
+
<td><code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code></td>
+ <td>true</td>
+ <td>
+ When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark
will reduce the post shuffle partitions number based on the map output
statistics.
+ </td>
+ </tr>
+ <tr>
+
<td><code>spark.sql.adaptive.shuffle.minNumPostShufflePartitions</code></td>
+ <td>1</td>
+ <td>
+ The advisory minimum number of post-shuffle partitions used when
<code>spark.sql.adaptive.enabled</code> and
<code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are
both enabled. It is suggested to be almost 2~3x of the parallelism when doing
benchmark.
+ </td>
+ </tr>
+ <tr>
+
<td><code>spark.sql.adaptive.shuffle.maxNumPostShufflePartitions</code></td>
+ <td>Int.MaxValue</td>
+ <td>
+ The advisory maximum number of post-shuffle partitions used in adaptive
execution. This is used as the initial number of pre-shuffle partitions. By
default it equals to <code>spark.sql.shuffle.partitions</code>.
+ </td>
+ </tr>
+ <tr>
+
<td><code>spark.sql.adaptive.shuffle.targetPostShuffleInputSize</code></td>
+ <td>67108864 (64 MB)</td>
+ <td>
+ The target post-shuffle input size in bytes of a task when
<code>spark.sql.adaptive.enabled</code> and
<code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are
both enabled.
+ </td>
+ </tr>
+ </table>
+
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when
converting the sort merge join to broadcast hash join in runtime and no
additional shuffle introduced. It takes effect when
`spark.sql.adaptive.enabled` and
`spark.sql.adaptive.shuffle.localShuffleReader.enabled` configuration
properties are both enabled. This feature and coalescing post shuffle partition
number feature can bring about 1.76x performance gain with query 77 in 3TB
TPC-DS.
Review comment:
ditto, don't put perf number in a user guide. Just briefly explain how it
affects user queries. E.g. save network traffic
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]