[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

GitBox Tue, 10 Mar 2020 01:42:03 -0700

HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] 
[SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390161131


 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names 
and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that 
makes use of the runtime statistics to choose the most efficient query 
execution plan. AQE is disabled by default. Spark SQL can use the umbrella 
configuration of `spark.sql.adaptive.enabled` to control whether turn it 
on/off. As of Spark 3.0, there are three major features in AQE, including 
coalescing post-shuffle partitions, local shuffle reader optimization and 
skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output 
statistics when `spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.coalescePartitions.enabled` configuration properties are 
both enabled. There are four following sub-configurations in this optimization 
rule. This feature simplifies the tuning of shuffle partitions number when 
running queries. You don't need to set a proper shuffle partition number to fit 
your dataset. You just need to set a large enough number and Spark can pick the 
proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark 
will reduce the post shuffle partitions number based on the map output 
statistics.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when 
<code>spark.sql.adaptive.enabled</code> and 
<code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It 
is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive 
execution. This is used as the initial number of pre-shuffle partitions. By 
default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when 
<code>spark.sql.adaptive.enabled</code> and 
<code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when 
converting the sort merge join to broadcast hash join in runtime and no 
additional shuffle introduced. It takes effect when 
`spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.localShuffleReader.enabled` configuration properties are 
both enabled. This feature can improve the performance by saving the network 
overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle 
the skewed partition when both enable `spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.skewJoin.enabled`. There are two following 
sub-configurations in this optimization rule. Data skew can severely downgrade 
performance of join queries. And this feature can split the skewed partition 
into multi parallel tasks instead of original 1 task to reduce the overhead of 
skewed join.
 
 Review comment:
   `choose` -> `chooses`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Reply via email to