[GitHub] [spark] viirya commented on a change in pull request #29498: [SPARK-32674][DOC] Add suggestion for parallel directory listing in t…

GitBox Thu, 20 Aug 2020 16:44:46 -0700


viirya commented on a change in pull request #29498:
URL: https://github.com/apache/spark/pull/29498#discussion_r474330233




##########
File path: docs/tuning.md
##########
@@ -264,6 +264,13 @@ parent RDD's number of partitions. You can pass the level 
of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+Sometimes you may also need to increase directory listing parallelism when job 
input has large number of directories,
+otherwise the process could take a very long time, especially when against 
object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via 
`SparkContext#sequenceFile`), the parallelism is
+controlled via 
`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default 
is 1). For other

Review comment:
       This seems having a limitation that multiple threads cannot be used with 
non thread-safe path filter?
   
   
https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
   
   > The number of threads to use to list and fetch block locations for the 
specified input paths. Note: multiple threads should not be used if a custom 
non thread-safe path filter is used. 
   
   Should we also mention it together?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #29498: [SPARK-32674][DOC] Add suggestion for parallel directory listing in t…

Reply via email to