This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push: new 4344a69 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc 4344a69 is described below commit 4344a6927a894982eea2a891fc37fd8c739cc8f0 Author: Chao Sun <sunc...@apache.org> AuthorDate: Fri Aug 21 16:48:54 2020 +0900 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <sunc...@apache.org> Signed-off-by: HyukjinKwon <gurwls...@apache.org> (cherry picked from commit bf221debd02b11003b092201d0326302196e4ba5) Signed-off-by: HyukjinKwon <gurwls...@apache.org> --- docs/sql-performance-tuning.md | 22 ++++++++++++++++++++++ docs/tuning.md | 11 +++++++++++ 2 files changed, 33 insertions(+) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 7c7c4a8..f1726ce 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -90,6 +90,28 @@ that these options will be deprecated in future release as more optimizations ar Configures the number of partitions to use when shuffling data for joins or aggregations. </td> </tr> + <tr> + <td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td> + <td>32</td> + <td> + Configures the threshold to enable parallel listing for job input paths. If the number of + input paths is larger than this threshold, Spark will list the files by using Spark distributed job. + Otherwise, it will fallback to sequential listing. This configuration is only effective when + using file-based data sources such as Parquet, ORC and JSON. + </td> + <td>1.5.0</td> + </tr> + <tr> + <td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td> + <td>10000</td> + <td> + Configures the maximum listing parallelism for job input paths. In case the number of input + paths is larger than this value, it will be throttled down to use this value. Same as above, + this configuration is only effective when using file-based data sources such as Parquet, ORC + and JSON. + </td> + <td>2.1.1</td> + </tr> </table> ## Broadcast Hint for SQL Queries diff --git a/docs/tuning.md b/docs/tuning.md index cd0f9cd..a42ed79 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -249,6 +249,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. +## Parallel Listing on Input Paths + +Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, +otherwise the process could take a very long time, especially when against object store like S3. +If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is +controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1). + +For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and +`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please +refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details. + ## Memory Usage of Reduce Tasks Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org