spark git commit: [SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc

wenchen Mon, 22 Oct 2018 21:20:27 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 b9b594ade -> 4099565cd



[SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc

## What changes were proposed in this pull request?

This PR replaces `turing` with `tuning` in files and a file name. Currently, in 
the left side menu, `Turing` is shown. [This 
page](https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/_site/sql-performance-turing.html)
 is one of examples.
![image](https://user-images.githubusercontent.com/1315079/47332714-20a96180-d6bb-11e8-9a5a-0a8dad292626.png)

## How was this patch tested?

`grep -rin turing docs` && `find docs -name "*turing*"`

Closes #22800 from kiszk/SPARK-24499-follow.

Authored-by: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Signed-off-by: Wenchen Fan <wenc...@databricks.com>
(cherry picked from commit c391dc65efb21357bdd80b28fba3851773759bc6)
Signed-off-by: Wenchen Fan <wenc...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4099565c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4099565c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4099565c

Branch: refs/heads/branch-2.4
Commit: 4099565cdddd887640b60e9c57d9dc7989e0c3ed
Parents: b9b594a
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Authored: Tue Oct 23 12:19:31 2018 +0800
Committer: Wenchen Fan <wenc...@databricks.com>
Committed: Tue Oct 23 12:20:00 2018 +0800

----------------------------------------------------------------------
 docs/_data/menu-sql.yaml            |  10 +-
 docs/sql-migration-guide-upgrade.md |   2 +-
 docs/sql-performance-tuning.md      | 151 +++++++++++++++++++++++++++++++
 docs/sql-performance-turing.md      | 151 -------------------------------
 4 files changed, 157 insertions(+), 157 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/_data/menu-sql.yaml
----------------------------------------------------------------------
diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index 6718763..cd065ea 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -36,15 +36,15 @@
       url: sql-data-sources-avro.html
     - text: Troubleshooting
       url: sql-data-sources-troubleshooting.html
-- text: Performance Turing
-  url: sql-performance-turing.html
+- text: Performance Tuning
+  url: sql-performance-tuning.html
   subitems:
     - text: Caching Data In Memory
-      url: sql-performance-turing.html#caching-data-in-memory
+      url: sql-performance-tuning.html#caching-data-in-memory
     - text: Other Configuration Options
-      url: sql-performance-turing.html#other-configuration-options
+      url: sql-performance-tuning.html#other-configuration-options
     - text: Broadcast Hint for SQL Queries
-      url: sql-performance-turing.html#broadcast-hint-for-sql-queries
+      url: sql-performance-tuning.html#broadcast-hint-for-sql-queries
 - text: Distributed SQL Engine
   url: sql-distributed-sql-engine.html
   subitems:

http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-migration-guide-upgrade.md
----------------------------------------------------------------------
diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 062e07b..af561f2 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -270,7 +270,7 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces 
nulls with booleans. In prior Spark versions, PySpark just ignores it and 
returns the original Dataset/DataFrame.
 
-  - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](sql-performance-turing.html#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
+  - Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](sql-performance-tuning.html#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
 
   - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns 
an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it 
always returns as a string despite of input types. To keep the old behavior, 
set `spark.sql.function.concatBinaryAsString` to `true`.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-performance-tuning.md
----------------------------------------------------------------------
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
new file mode 100644
index 0000000..7c7c4a8
--- /dev/null
+++ b/docs/sql-performance-tuning.md
@@ -0,0 +1,151 @@
+---
+layout: global
+title: Performance Tuning
+displayTitle: Performance Tuning
+---
+
+* Table of contents
+{:toc}
+
+For some workloads, it is possible to improve performance by either caching 
data in memory, or by
+turning on some experimental options.
+
+## Caching Data In Memory
+
+Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
+memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
+
+Configuration of in-memory caching can be done using the `setConf` method on 
`SparkSession` or by running
+`SET key=value` commands using SQL.
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
+  <td>true</td>
+  <td>
+    When set to true Spark SQL will automatically select a compression codec 
for each column based
+    on statistics of the data.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
+  <td>10000</td>
+  <td>
+    Controls the size of batches for columnar caching. Larger batch sizes can 
improve memory utilization
+    and compression, but risk OOMs when caching data.
+  </td>
+</tr>
+
+</table>
+
+## Other Configuration Options
+
+The following options can also be used to tune the performance of query 
execution. It is possible
+that these options will be deprecated in future release as more optimizations 
are performed automatically.
+
+<table class="table">
+  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+  <tr>
+    <td><code>spark.sql.files.maxPartitionBytes</code></td>
+    <td>134217728 (128 MB)</td>
+    <td>
+      The maximum number of bytes to pack into a single partition when reading 
files.
+    </td>
+  </tr>
+  <tr>
+    <td><code>spark.sql.files.openCostInBytes</code></td>
+    <td>4194304 (4 MB)</td>
+    <td>
+      The estimated cost to open a file, measured by the number of bytes could 
be scanned in the same
+      time. This is used when putting multiple files into a partition. It is 
better to over-estimated,
+      then the partitions with small files will be faster than partitions with 
bigger files (which is
+      scheduled first).
+    </td>
+  </tr>
+  <tr>
+    <td><code>spark.sql.broadcastTimeout</code></td>
+    <td>300</td>
+    <td>
+    <p>
+      Timeout in seconds for the broadcast wait time in broadcast joins
+    </p>
+    </td>
+  </tr>
+  <tr>
+    <td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
+    <td>10485760 (10 MB)</td>
+    <td>
+      Configures the maximum size in bytes for a table that will be broadcast 
to all worker nodes when
+      performing a join. By setting this value to -1 broadcasting can be 
disabled. Note that currently
+      statistics are only supported for Hive Metastore tables where the command
+      <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> 
has been run.
+    </td>
+  </tr>
+  <tr>
+    <td><code>spark.sql.shuffle.partitions</code></td>
+    <td>200</td>
+    <td>
+      Configures the number of partitions to use when shuffling data for joins 
or aggregations.
+    </td>
+  </tr>
+</table>
+
+## Broadcast Hint for SQL Queries
+
+The `BROADCAST` hint guides Spark to broadcast each specified table when 
joining them with another table or view.
+When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is 
preferred,
+even if the statistics is above the configuration 
`spark.sql.autoBroadcastJoinThreshold`.
+When both sides of a join are specified, Spark broadcasts the one having the 
lower statistics.
+Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. 
full outer join)
+support BHJ. When the broadcast nested loop join is selected, we still respect 
the hint.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+
+{% highlight scala %}
+import org.apache.spark.sql.functions.broadcast
+broadcast(spark.table("src")).join(spark.table("records"), "key").show()
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java"  markdown="1">
+
+{% highlight java %}
+import static org.apache.spark.sql.functions.broadcast;
+broadcast(spark.table("src")).join(spark.table("records"), "key").show();
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+from pyspark.sql.functions import broadcast
+broadcast(spark.table("src")).join(spark.table("records"), "key").show()
+{% endhighlight %}
+
+</div>
+
+<div data-lang="r"  markdown="1">
+
+{% highlight r %}
+src <- sql("SELECT * FROM src")
+records <- sql("SELECT * FROM records")
+head(join(broadcast(src), records, src$key == records$key))
+{% endhighlight %}
+
+</div>
+
+<div data-lang="sql"  markdown="1">
+
+{% highlight sql %}
+-- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint
+SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key
+{% endhighlight %}
+
+</div>
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/4099565c/docs/sql-performance-turing.md
----------------------------------------------------------------------
diff --git a/docs/sql-performance-turing.md b/docs/sql-performance-turing.md
deleted file mode 100644
index 7c7c4a8..0000000
--- a/docs/sql-performance-turing.md
+++ /dev/null
@@ -1,151 +0,0 @@
----
-layout: global
-title: Performance Tuning
-displayTitle: Performance Tuning
----
-
-* Table of contents
-{:toc}
-
-For some workloads, it is possible to improve performance by either caching 
data in memory, or by
-turning on some experimental options.
-
-## Caching Data In Memory
-
-Spark SQL can cache tables using an in-memory columnar format by calling 
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
-Then Spark SQL will scan only required columns and will automatically tune 
compression to minimize
-memory usage and GC pressure. You can call 
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
-
-Configuration of in-memory caching can be done using the `setConf` method on 
`SparkSession` or by running
-`SET key=value` commands using SQL.
-
-<table class="table">
-<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
-  <td>true</td>
-  <td>
-    When set to true Spark SQL will automatically select a compression codec 
for each column based
-    on statistics of the data.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
-  <td>10000</td>
-  <td>
-    Controls the size of batches for columnar caching. Larger batch sizes can 
improve memory utilization
-    and compression, but risk OOMs when caching data.
-  </td>
-</tr>
-
-</table>
-
-## Other Configuration Options
-
-The following options can also be used to tune the performance of query 
execution. It is possible
-that these options will be deprecated in future release as more optimizations 
are performed automatically.
-
-<table class="table">
-  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
-  <tr>
-    <td><code>spark.sql.files.maxPartitionBytes</code></td>
-    <td>134217728 (128 MB)</td>
-    <td>
-      The maximum number of bytes to pack into a single partition when reading 
files.
-    </td>
-  </tr>
-  <tr>
-    <td><code>spark.sql.files.openCostInBytes</code></td>
-    <td>4194304 (4 MB)</td>
-    <td>
-      The estimated cost to open a file, measured by the number of bytes could 
be scanned in the same
-      time. This is used when putting multiple files into a partition. It is 
better to over-estimated,
-      then the partitions with small files will be faster than partitions with 
bigger files (which is
-      scheduled first).
-    </td>
-  </tr>
-  <tr>
-    <td><code>spark.sql.broadcastTimeout</code></td>
-    <td>300</td>
-    <td>
-    <p>
-      Timeout in seconds for the broadcast wait time in broadcast joins
-    </p>
-    </td>
-  </tr>
-  <tr>
-    <td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
-    <td>10485760 (10 MB)</td>
-    <td>
-      Configures the maximum size in bytes for a table that will be broadcast 
to all worker nodes when
-      performing a join. By setting this value to -1 broadcasting can be 
disabled. Note that currently
-      statistics are only supported for Hive Metastore tables where the command
-      <code>ANALYZE TABLE &lt;tableName&gt; COMPUTE STATISTICS noscan</code> 
has been run.
-    </td>
-  </tr>
-  <tr>
-    <td><code>spark.sql.shuffle.partitions</code></td>
-    <td>200</td>
-    <td>
-      Configures the number of partitions to use when shuffling data for joins 
or aggregations.
-    </td>
-  </tr>
-</table>
-
-## Broadcast Hint for SQL Queries
-
-The `BROADCAST` hint guides Spark to broadcast each specified table when 
joining them with another table or view.
-When Spark deciding the join methods, the broadcast hash join (i.e., BHJ) is 
preferred,
-even if the statistics is above the configuration 
`spark.sql.autoBroadcastJoinThreshold`.
-When both sides of a join are specified, Spark broadcasts the one having the 
lower statistics.
-Note Spark does not guarantee BHJ is always chosen, since not all cases (e.g. 
full outer join)
-support BHJ. When the broadcast nested loop join is selected, we still respect 
the hint.
-
-<div class="codetabs">
-
-<div data-lang="scala"  markdown="1">
-
-{% highlight scala %}
-import org.apache.spark.sql.functions.broadcast
-broadcast(spark.table("src")).join(spark.table("records"), "key").show()
-{% endhighlight %}
-
-</div>
-
-<div data-lang="java"  markdown="1">
-
-{% highlight java %}
-import static org.apache.spark.sql.functions.broadcast;
-broadcast(spark.table("src")).join(spark.table("records"), "key").show();
-{% endhighlight %}
-
-</div>
-
-<div data-lang="python"  markdown="1">
-
-{% highlight python %}
-from pyspark.sql.functions import broadcast
-broadcast(spark.table("src")).join(spark.table("records"), "key").show()
-{% endhighlight %}
-
-</div>
-
-<div data-lang="r"  markdown="1">
-
-{% highlight r %}
-src <- sql("SELECT * FROM src")
-records <- sql("SELECT * FROM records")
-head(join(broadcast(src), records, src$key == records$key))
-{% endhighlight %}
-
-</div>
-
-<div data-lang="sql"  markdown="1">
-
-{% highlight sql %}
--- We accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint
-SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key
-{% endhighlight %}
-
-</div>
-</div>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc

Reply via email to