Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22746#discussion_r226250306
--- Diff: docs/sql-performance-turing.md ---
@@ -0,0 +1,151 @@
+---
+layout: global
+title: Performance Tuning
+displayTitle: Performance Tuning
+---
+
+* Table of contents
+{:toc}
+
+For some workloads, it is possible to improve performance by either
caching data in memory, or by
+turning on some experimental options.
+
+## Caching Data In Memory
+
+Spark SQL can cache tables using an in-memory columnar format by calling
`spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
+Then Spark SQL will scan only required columns and will automatically tune
compression to minimize
+memory usage and GC pressure. You can call
`spark.catalog.uncacheTable("tableName")` to remove the table from memory.
+
+Configuration of in-memory caching can be done using the `setConf` method
on `SparkSession` or by running
+`SET key=value` commands using SQL.
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+ <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
+ <td>true</td>
+ <td>
+ When set to true Spark SQL will automatically select a compression
codec for each column based
+ on statistics of the data.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
+ <td>10000</td>
+ <td>
+ Controls the size of batches for columnar caching. Larger batch sizes
can improve memory utilization
+ and compression, but risk OOMs when caching data.
+ </td>
+</tr>
+
+</table>
+
+## Other Configuration Options
+
+The following options can also be used to tune the performance of query
execution. It is possible
+that these options will be deprecated in future release as more
optimizations are performed automatically.
+
+<table class="table">
+ <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+ <tr>
+ <td><code>spark.sql.files.maxPartitionBytes</code></td>
+ <td>134217728 (128 MB)</td>
+ <td>
+ The maximum number of bytes to pack into a single partition when
reading files.
+ </td>
+ </tr>
+ <tr>
+ <td><code>spark.sql.files.openCostInBytes</code></td>
+ <td>4194304 (4 MB)</td>
+ <td>
+ The estimated cost to open a file, measured by the number of bytes
could be scanned in the same
+ time. This is used when putting multiple files into a partition. It
is better to over estimated,
--- End diff --
nit: `It is better to over estimated` -> ` It is better to over-estimate`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]