[GitHub] [spark] sunchao commented on a change in pull request #29498: [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

GitBox Thu, 20 Aug 2020 23:32:54 -0700


sunchao commented on a change in pull request #29498:
URL: https://github.com/apache/spark/pull/29498#discussion_r474435185




##########
File path: docs/tuning.md
##########
@@ -264,6 +264,13 @@ parent RDD's number of partitions. You can pass the level 
of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+Sometimes you may also need to increase directory listing parallelism when job 
input has large number of directories,
+otherwise the process could take a very long time, especially when against 
object store like S3.

Review comment:
       It depends on how "remote" the storage is. For HDFS, depending on the 
use case the compute and storage can still deployed within the same region and 
therefore network/metadata cost is much cheaper than that from S3.
   
   Therefore, IMO think we can stick with the S3 case as it is more 
characteristic. Let me know if you think otherwise.

##########
File path: docs/tuning.md
##########
@@ -264,6 +264,13 @@ parent RDD's number of partitions. You can pass the level 
of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+Sometimes you may also need to increase directory listing parallelism when job 
input has large number of directories,
+otherwise the process could take a very long time, especially when against 
object store like S3.

Review comment:
       It depends on how "remote" the storage is. For HDFS, depending on the 
use case the compute and storage can still deployed within the same region and 
therefore network/metadata cost is much cheaper than that from S3.
   
   Therefore, I think we can stick with the S3 case as it is more 
characteristic. Let me know if you think otherwise.

##########
File path: docs/tuning.md
##########
@@ -264,6 +264,13 @@ parent RDD's number of partitions. You can pass the level 
of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
+Sometimes you may also need to increase directory listing parallelism when job 
input has large number of directories,
+otherwise the process could take a very long time, especially when against 
object store like S3.

Review comment:
       It depends on how "remote" the storage is. For HDFS, depending on the 
use case the compute and storage can still be deployed within the same region 
and therefore network/metadata cost is much cheaper than that from S3.
   
   Therefore, I think we can stick with the S3 case as it is more 
characteristic. Let me know if you think otherwise.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #29498: [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

Reply via email to