Repository: spark Updated Branches: refs/heads/branch-2.4 b9b594ade -> 4099565cd
[SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc ## What changes were proposed in this pull request? This PR replaces `turing` with `tuning` in files and a file name. Currently, in the left side menu, `Turing` is shown. [This page](https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/_site/sql-performance-turing.html) is one of examples. ![image](https://user-images.githubusercontent.com/1315079/47332714-20a96180-d6bb-11e8-9a5a-0a8dad292626.png) ## How was this patch tested? `grep -rin turing docs` && `find docs -name "*turing*"` Closes #22800 from kiszk/SPARK-24499-follow. Authored-by: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Signed-off-by: Wenchen Fan <wenc...@databricks.com> (cherry picked from commit c391dc65efb21357bdd80b28fba3851773759bc6) Signed-off-by: Wenchen Fan <wenc...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4099565c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4099565c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4099565c Branch: refs/heads/branch-2.4 Commit: 4099565cdddd887640b60e9c57d9dc7989e0c3ed Parents: b9b594a Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Authored: Tue Oct 23 12:19:31 2018 +0800 Committer: Wenchen Fan <wenc...@databricks.com> Committed: Tue Oct 23 12:20:00 2018 +0800 ---------------------------------------------------------------------- docs/_data/menu-sql.yaml | 10 +- docs/sql-migration-guide-upgrade.md | 2 +- docs/sql-performance-tuning.md | 151 +++++++++++++++++++++++++++++++ docs/sql-performance-turing.md | 151 ------------------------------- 4 files changed, 157 insertions(+), 157 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/_data/menu-sql.yaml ---------------------------------------------------------------------- diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index 6718763..cd065ea 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -36,15 +36,15 @@ url: sql-data-sources-avro.html - text: Troubleshooting url: sql-data-sources-troubleshooting.html -- text: Performance Turing - url: sql-performance-turing.html +- text: Performance Tuning + url: sql-performance-tuning.html subitems: - text: Caching Data In Memory - url: sql-performance-turing.html#caching-data-in-memory + url: sql-performance-tuning.html#caching-data-in-memory - text: Other Configuration Options - url: sql-performance-turing.html#other-configuration-options + url: sql-performance-tuning.html#other-configuration-options - text: Broadcast Hint for SQL Queries - url: sql-performance-turing.html#broadcast-hint-for-sql-queries + url: sql-performance-tuning.html#broadcast-hint-for-sql-queries - text: Distributed SQL Engine url: sql-distributed-sql-engine.html subitems: http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-migration-guide-upgrade.md ---------------------------------------------------------------------- diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 062e07b..af561f2 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -270,7 +270,7 @@ displayTitle: Spark SQL Upgrading Guide - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame. - - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](sql-performance-turing.html#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489). + - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](sql-performance-tuning.html#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489). - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`. http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-performance-tuning.md ---------------------------------------------------------------------- diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md new file mode 100644 index 0000000..7c7c4a8 --- /dev/null +++ b/docs/sql-performance-tuning.md @@ -0,0 +1,151 @@ +--- +layout: global +title: Performance Tuning +displayTitle: Performance Tuning +--- + +* Table of contents +{:toc} + +For some workloads, it is possible to improve performance by either caching data in memory, or by +turning on some experimental options. + +## Caching Data In Memory + +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. +Then Spark SQL will scan only required columns and will automatically tune compression to minimize +memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` to remove the table from memory. + +Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running +`SET key=value` commands using SQL. + +<table class="table"> +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> +<tr> + <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> + <td>true</td> + <td> + When set to true Spark SQL will automatically select a compression codec for each column based + on statistics of the data. + </td> +</tr> +<tr> + <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td> + <td>10000</td> + <td> + Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization + and compression, but risk OOMs when caching data. + </td> +</tr> + +</table> + +## Other Configuration Options + +The following options can also be used to tune the performance of query execution. It is possible +that these options will be deprecated in future release as more optimizations are performed automatically. + +<table class="table"> + <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> + <tr> + <td><code>spark.sql.files.maxPartitionBytes</code></td> + <td>134217728 (128 MB)</td> + <td> + The maximum number of bytes to pack into a single partition when reading files. + </td> + </tr> + <tr> + <td><code>spark.sql.files.openCostInBytes</code></td> + <td>4194304 (4 MB)</td> + <td> + The estimated cost to open a file, measured by the number of bytes could be scanned in the same + time. This is used when putting multiple files into a partition. It is better to over-estimated, + then the partitions with small files will be faster than partitions with bigger files (which is + scheduled first). + </td> + </tr> + <tr> + <td><code>spark.sql.broadcastTimeout</code></td> + <td>300</td> + <td> + <p> + Timeout in seconds for the broadcast wait time in broadcast joins + </p> + </td> + </tr> + <tr> + <td><code>spark.sql.autoBroadcastJoinThreshold</code></td> + <td>10485760 (10 MB)</td> + <td> + Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when + performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently + statistics are only supported for Hive Metastore tables where the command + <code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run. + </td> + </tr> + <tr> + <td><code>spark.sql.shuffle.partitions</code></td> + <td>200</td> + <td> + Configures the number of partitions to use when shuffling data for joins or aggregations. + </td> + </tr> +</table> + +## Broadcast Hint for SQL Queries + +The `BROADCAST` hint guides Spark to broadcast each specified table when joining them with another table or view. +When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, +even if the statistics is above the configuration `spark.sql.autoBroadcastJoinThreshold`. +When both sides of a join are specified, Spark broadcasts the one having the lower statistics. +Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) +support BHJ. When the broadcast nested loop join is selected, we still respect the hint. + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> + +{% highlight scala %} +import org.apache.spark.sql.functions.broadcast +broadcast(spark.table("src")).join(spark.table("records"), "key").show() +{% endhighlight %} + +</div> + +<div data-lang="java" markdown="1"> + +{% highlight java %} +import static org.apache.spark.sql.functions.broadcast; +broadcast(spark.table("src")).join(spark.table("records"), "key").show(); +{% endhighlight %} + +</div> + +<div data-lang="python" markdown="1"> + +{% highlight python %} +from pyspark.sql.functions import broadcast +broadcast(spark.table("src")).join(spark.table("records"), "key").show() +{% endhighlight %} + +</div> + +<div data-lang="r" markdown="1"> + +{% highlight r %} +src <- sql("SELECT * FROM src") +records <- sql("SELECT * FROM records") +head(join(broadcast(src), records, src$key == records$key)) +{% endhighlight %} + +</div> + +<div data-lang="sql" markdown="1"> + +{% highlight sql %} +-- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint +SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key +{% endhighlight %} + +</div> +</div> http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-performance-turing.md ---------------------------------------------------------------------- diff --git a/docs/sql-performance-turing.md b/docs/sql-performance-turing.md deleted file mode 100644 index 7c7c4a8..0000000 --- a/docs/sql-performance-turing.md +++ /dev/null @@ -1,151 +0,0 @@ ---- -layout: global -title: Performance Tuning -displayTitle: Performance Tuning ---- - -* Table of contents -{:toc} - -For some workloads, it is possible to improve performance by either caching data in memory, or by -turning on some experimental options. - -## Caching Data In Memory - -Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`. -Then Spark SQL will scan only required columns and will automatically tune compression to minimize -memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` to remove the table from memory. - -Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running -`SET key=value` commands using SQL. - -<table class="table"> -<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> -<tr> - <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td> - <td>true</td> - <td> - When set to true Spark SQL will automatically select a compression codec for each column based - on statistics of the data. - </td> -</tr> -<tr> - <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td> - <td>10000</td> - <td> - Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization - and compression, but risk OOMs when caching data. - </td> -</tr> - -</table> - -## Other Configuration Options - -The following options can also be used to tune the performance of query execution. It is possible -that these options will be deprecated in future release as more optimizations are performed automatically. - -<table class="table"> - <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> - <tr> - <td><code>spark.sql.files.maxPartitionBytes</code></td> - <td>134217728 (128 MB)</td> - <td> - The maximum number of bytes to pack into a single partition when reading files. - </td> - </tr> - <tr> - <td><code>spark.sql.files.openCostInBytes</code></td> - <td>4194304 (4 MB)</td> - <td> - The estimated cost to open a file, measured by the number of bytes could be scanned in the same - time. This is used when putting multiple files into a partition. It is better to over-estimated, - then the partitions with small files will be faster than partitions with bigger files (which is - scheduled first). - </td> - </tr> - <tr> - <td><code>spark.sql.broadcastTimeout</code></td> - <td>300</td> - <td> - <p> - Timeout in seconds for the broadcast wait time in broadcast joins - </p> - </td> - </tr> - <tr> - <td><code>spark.sql.autoBroadcastJoinThreshold</code></td> - <td>10485760 (10 MB)</td> - <td> - Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when - performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently - statistics are only supported for Hive Metastore tables where the command - <code>ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan</code> has been run. - </td> - </tr> - <tr> - <td><code>spark.sql.shuffle.partitions</code></td> - <td>200</td> - <td> - Configures the number of partitions to use when shuffling data for joins or aggregations. - </td> - </tr> -</table> - -## Broadcast Hint for SQL Queries - -The `BROADCAST` hint guides Spark to broadcast each specified table when joining them with another table or view. -When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is preferred, -even if the statistics is above the configuration `spark.sql.autoBroadcastJoinThreshold`. -When both sides of a join are specified, Spark broadcasts the one having the lower statistics. -Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. full outer join) -support BHJ. When the broadcast nested loop join is selected, we still respect the hint. - -<div class="codetabs"> - -<div data-lang="scala" markdown="1"> - -{% highlight scala %} -import org.apache.spark.sql.functions.broadcast -broadcast(spark.table("src")).join(spark.table("records"), "key").show() -{% endhighlight %} - -</div> - -<div data-lang="java" markdown="1"> - -{% highlight java %} -import static org.apache.spark.sql.functions.broadcast; -broadcast(spark.table("src")).join(spark.table("records"), "key").show(); -{% endhighlight %} - -</div> - -<div data-lang="python" markdown="1"> - -{% highlight python %} -from pyspark.sql.functions import broadcast -broadcast(spark.table("src")).join(spark.table("records"), "key").show() -{% endhighlight %} - -</div> - -<div data-lang="r" markdown="1"> - -{% highlight r %} -src <- sql("SELECT * FROM src") -records <- sql("SELECT * FROM records") -head(join(broadcast(src), records, src$key == records$key)) -{% endhighlight %} - -</div> - -<div data-lang="sql" markdown="1"> - -{% highlight sql %} --- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint -SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key -{% endhighlight %} - -</div> -</div> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org