[spark] branch branch-2.4 updated: [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

gurwls223 Fri, 21 Aug 2020 01:04:54 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 4344a69  [SPARK-32674][DOC] Add suggestion for parallel directory 
listing in tuning doc
4344a69 is described below

commit 4344a6927a894982eea2a891fc37fd8c739cc8f0
Author: Chao Sun <sunc...@apache.org>
AuthorDate: Fri Aug 21 16:48:54 2020 +0900

    [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning 
doc
    
    ### What changes were proposed in this pull request?
    
    This adds some tuning guide for increasing parallelism of directory listing.
    
    ### Why are the changes needed?
    
    Sometimes when job input has large number of directories, the listing can 
become a bottleneck. There are a few parameters to tune this. This adds some 
info to Spark tuning guide to make the knowledge better shared.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    N/A
    
    Closes #29498 from sunchao/SPARK-32674.
    
    Authored-by: Chao Sun <sunc...@apache.org>
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
    (cherry picked from commit bf221debd02b11003b092201d0326302196e4ba5)
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
---
 docs/sql-performance-tuning.md | 22 ++++++++++++++++++++++
 docs/tuning.md                 | 11 +++++++++++
 2 files changed, 33 insertions(+)

diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 7c7c4a8..f1726ce 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -90,6 +90,28 @@ that these options will be deprecated in future release as 
more optimizations ar
       Configures the number of partitions to use when shuffling data for joins 
or aggregations.
     </td>
   </tr>
+  <tr>
+    
<td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
+    <td>32</td>
+    <td>
+      Configures the threshold to enable parallel listing for job input paths. 
If the number of
+      input paths is larger than this threshold, Spark will list the files by 
using Spark distributed job.
+      Otherwise, it will fallback to sequential listing. This configuration is 
only effective when
+      using file-based data sources such as Parquet, ORC and JSON.
+    </td>
+    <td>1.5.0</td>
+  </tr>
+  <tr>
+    
<td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
+    <td>10000</td>
+    <td>
+      Configures the maximum listing parallelism for job input paths. In case 
the number of input
+      paths is larger than this value, it will be throttled down to use this 
value. Same as above,
+      this configuration is only effective when using file-based data sources 
such as Parquet, ORC
+      and JSON.
+    </td>
+    <td>2.1.1</td>
+  </tr>
 </table>
 
 ## Broadcast Hint for SQL Queries
diff --git a/docs/tuning.md b/docs/tuning.md
index cd0f9cd..a42ed79 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -249,6 +249,17 @@ parent RDD's number of partitions. You can pass the level 
of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+## Parallel Listing on Input Paths
+
+Sometimes you may also need to increase directory listing parallelism when job 
input has large number of directories,
+otherwise the process could take a very long time, especially when against 
object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via 
`SparkContext.sequenceFile`), the parallelism is
+controlled via 
[`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml)
 (currently default is 1).
+
+For Spark SQL with file-based data sources, you can tune 
`spark.sql.sources.parallelPartitionDiscovery.threshold` and
+`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing 
parallelism. Please
+refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for 
more details.
+
 ## Memory Usage of Reduce Tasks
 
 Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in 
memory, but because the


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

Reply via email to