[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

GitBox Mon, 16 Mar 2020 00:48:31 -0700

cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add 
the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392834386


 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names 
and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that 
makes use of the runtime statistics to choose the most efficient query 
execution plan. AQE is disabled by default. Spark SQL can use the umbrella 
configuration of `spark.sql.adaptive.enabled` to control whether turn it 
on/off. As of Spark 3.0, there are three major features in AQE, including 
coalescing coalescing post-shuffle partitions, converting sort-merge join to 
broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output 
statistics when both `spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.coalescePartitions.enabled` configuration properties are 
enabled. There are four following sub-configurations in this optimization rule. 
This feature simplifies the tuning of shuffle partition number when running 
queries. You do not need to set a proper shuffle partition number to fit your 
dataset. Spark can pick the proper shuffle partition number at runtime once you 
set a large enough initial number of shuffle partitions via 
`spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark 
will coalesce contiguous shuffle partitions according to the target size 
(specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to 
avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, 
the default value is the default parallelism of the Spark cluster. This 
configuration only has an effect when <code>spark.sql.adaptive.enabled</code> 
and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     
<td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default 
it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only 
has an effect when <code>spark.sql.adaptive.enabled</code> and 
<code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive 
optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes 
effect when Spark coalesces small shuffle partitions or splits skewed shuffle 
partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
+AQE converts the sort merge join to broad cast hash join when the runtime 
statistics of any join side is smaller than the broadcast hash join threshold. 
This feature can optimize the shuffle reader to local shuffle reader after 
converting the sort merge join to broadcast hash join at runtime and if no 
additional shuffle is introduced. It takes effect when both 
`spark.sql.adaptive.enabled` and 
`spark.sql.adaptive.localShuffleReader.enabled` configuration properties are 
enabled. This feature can improve the performance by saving the network 
overhead of shuffle process.
 
 Review comment:
   `broad cast` -> `broadcast`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Reply via email to