REPARTITION_BY_RANGE Hints to SQL Reference

GitBox Fri, 29 May 2020 00:48:49 -0700


maropu commented on a change in pull request #28672:
URL: https://github.com/apache/spark/pull/28672#discussion_r432308677




##########
File path: docs/sql-ref-syntax-qry-select-hints.md
##########
@@ -21,14 +21,69 @@ license: |
 
 ### Description
 
-Join Hints allow users to suggest the join strategy that Spark should use. 
Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, 
`SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. 
When different join strategy hints are specified on both sides of a join, Spark 
prioritizes hints in the following order: `BROADCAST` over `MERGE` over 
`SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with 
the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side 
based on the join type and the sizes of the relations. Since a given strategy 
may not support all join types, Spark is not guaranteed to use the join 
strategy suggested by the hint.
+Hints give users a way to suggest how Spark SQL to use specific approaches to 
generate its execution plan.
 
 ### Syntax
 
 ```sql
-/*+ join_hint [ , ... ] */
+/*+ hint [ , ... ] */
 ```
 
+### Partitioning Hints
+
+`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities 
equivalent to those of the
+`Dataset` `coalesce`/`repartition`/`repartitionByRange` APIs. The `COALESCE` 
hint can be used to reduce
+the number of partitions to the specified number of partitions. The 
`REPARTITION`/`REPARTITION_BY_RANGE`
+hint can be used to repartition to the specified number of partitions using 
the specified partitioning expressions.
+The `COALESCE` hint takes a partition number as a
+parameter. The `REPARTITION` hint takes a partition number, column names, or 
both as parameters.
+The `REPARTITION_BY_RANGE` hint takes column names and an optional partition 
number as parameters.
+These hints give users a way to tune performance and control the number of 
output files in Spark SQL.
+
+### Examples
+```sql
+SELECT /*+ COALESCE(3) */ * FROM t;
+
+EXPLAIN SELECT /*+ COALESCE(3) */ * FROM t;
+== Physical Plan ==
+Coalesce 3
++- *(1) ColumnarToRow
+   +- FileScan parquet default.t[name#5,c#6] Batched: true, DataFilters: [], 
Format: Parquet,
+      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], 
PartitionFilters: [],
+      PushedFilters: [], ReadSchema: struct<name:string>
+
+SELECT /*+ REPARTITION(3) */ * FROM t;
+
+SELECT /*+ REPARTITION(c) */ * FROM t;
+
+SELECT /*+ REPARTITION(3, c) */ * FROM t;
+
+EXPLAIN SELECT /*+ REPARTITION(3, c) */ * FROM t;
+== Physical Plan ==
+Exchange hashpartitioning(c#6, 3), false, [id=#148]
++- *(1) ColumnarToRow
+   +- FileScan parquet default.t[name#5,c#6] Batched: true, DataFilters: [], 
Format: Parquet,
+      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], 
PartitionFilters: [],
+      PushedFilters: [], ReadSchema: struct<name:string>
+
+SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;
+
+SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;
+
+EXPLAIN SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;
+== Physical Plan ==
+Exchange rangepartitioning(c#6 ASC NULLS FIRST, 3), false, [id=#167]
++- *(1) ColumnarToRow
+   +- FileScan parquet default.t[name#5,c#6] Batched: true, DataFilters: [], 
Format: Parquet,
+      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], 
PartitionFilters: [],
+      PushedFilters: [], ReadSchema: struct<name:string>
+```
+
+
+### Join Hints
+
+Join Hints allow users to suggest the join strategy that Spark should use. 
Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, 
`SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. 
When different join strategy hints are specified on both sides of a join, Spark 
prioritizes hints in the following order: `BROADCAST` over `MERGE` over 
`SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with 
the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side 
based on the join type and the sizes of the relations. Since a given strategy 
may not support all join types, Spark is not guaranteed to use the join 
strategy suggested by the hint.

Review comment:
       `Hints` -> `hints`?

##########
File path: docs/sql-ref-syntax-qry-select-hints.md
##########
@@ -21,14 +21,69 @@ license: |
 
 ### Description
 
-Join Hints allow users to suggest the join strategy that Spark should use. 
Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, 
`SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. 
When different join strategy hints are specified on both sides of a join, Spark 
prioritizes hints in the following order: `BROADCAST` over `MERGE` over 
`SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with 
the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side 
based on the join type and the sizes of the relations. Since a given strategy 
may not support all join types, Spark is not guaranteed to use the join 
strategy suggested by the hint.
+Hints give users a way to suggest how Spark SQL to use specific approaches to 
generate its execution plan.
 
 ### Syntax
 
 ```sql
-/*+ join_hint [ , ... ] */
+/*+ hint [ , ... ] */
 ```
 
+### Partitioning Hints
+
+`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities 
equivalent to those of the
+`Dataset` `coalesce`/`repartition`/`repartitionByRange` APIs. The `COALESCE` 
hint can be used to reduce
+the number of partitions to the specified number of partitions. The 
`REPARTITION`/`REPARTITION_BY_RANGE`
+hint can be used to repartition to the specified number of partitions using 
the specified partitioning expressions.
+The `COALESCE` hint takes a partition number as a
+parameter. The `REPARTITION` hint takes a partition number, column names, or 
both as parameters.
+The `REPARTITION_BY_RANGE` hint takes column names and an optional partition 
number as parameters.
+These hints give users a way to tune performance and control the number of 
output files in Spark SQL.
+
+### Examples
+```sql
+SELECT /*+ COALESCE(3) */ * FROM t;
+
+EXPLAIN SELECT /*+ COALESCE(3) */ * FROM t;
+== Physical Plan ==
+Coalesce 3
++- *(1) ColumnarToRow
+   +- FileScan parquet default.t[name#5,c#6] Batched: true, DataFilters: [], 
Format: Parquet,
+      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], 
PartitionFilters: [],
+      PushedFilters: [], ReadSchema: struct<name:string>
+
+SELECT /*+ REPARTITION(3) */ * FROM t;

Review comment:
       We still need these statements having no output as the example?

##########
File path: docs/sql-ref-syntax-qry-select-hints.md
##########
@@ -21,14 +21,69 @@ license: |
 
 ### Description
 
-Join Hints allow users to suggest the join strategy that Spark should use. 
Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, 
`SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. 
When different join strategy hints are specified on both sides of a join, Spark 
prioritizes hints in the following order: `BROADCAST` over `MERGE` over 
`SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with 
the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side 
based on the join type and the sizes of the relations. Since a given strategy 
may not support all join types, Spark is not guaranteed to use the join 
strategy suggested by the hint.
+Hints give users a way to suggest how Spark SQL to use specific approaches to 
generate its execution plan.
 
 ### Syntax
 
 ```sql
-/*+ join_hint [ , ... ] */
+/*+ hint [ , ... ] */
 ```
 
+### Partitioning Hints
+
+`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities 
equivalent to those of the
+`Dataset` `coalesce`/`repartition`/`repartitionByRange` APIs. The `COALESCE` 
hint can be used to reduce

Review comment:
       How about moving the explanations for each hint (e.g., `The COALESCE 
hint can be used to reduce...`) into a new section like  `### Partitiong Hints 
Types`?

##########
File path: docs/sql-ref-syntax-qry-select-hints.md
##########
@@ -21,14 +21,69 @@ license: |
 
 ### Description
 
-Join Hints allow users to suggest the join strategy that Spark should use. 
Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, 
`SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. 
When different join strategy hints are specified on both sides of a join, Spark 
prioritizes hints in the following order: `BROADCAST` over `MERGE` over 
`SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with 
the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side 
based on the join type and the sizes of the relations. Since a given strategy 
may not support all join types, Spark is not guaranteed to use the join 
strategy suggested by the hint.
+Hints give users a way to suggest how Spark SQL to use specific approaches to 
generate its execution plan.
 
 ### Syntax
 
 ```sql
-/*+ join_hint [ , ... ] */
+/*+ hint [ , ... ] */
 ```
 
+### Partitioning Hints
+
+`COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities 
equivalent to those of the

Review comment:
       How about rephrasing it like this?
   
   ---
   Partitioning hints allow users to suggest a partitioning way that Spark 
should follow. `COALESCE`, `REPARTITION`, and `REPARTITION_BY_RANGE` hints are 
supported and they are equivalent to `coalesce`, `repartition`, and 
`repartitionByRange` Dataset APIs, respectively.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on a change in pull request #28672: [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference

Reply via email to