[
https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394196#comment-15394196
]
Apache Spark commented on SPARK-16736:
--------------------------------------
User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/14371
> remove redundant FileSystem status checks calls from Spark codebase
> -------------------------------------------------------------------
>
> Key: SPARK-16736
> URL: https://issues.apache.org/jira/browse/SPARK-16736
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.0.0
> Reporter: Steve Loughran
> Priority: Minor
>
> The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are
> wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS
> NN, and very, very slow against object stores.
> # if these calls are followed by any getStatus() calls then they can be
> eliminated by careful merging and pulling out the catching of
> {FileNotFoundException}} from the exists() call to the spark code.
> # Any sequence of exists + delete can be optimised by removing the exists
> check, relying on {{FileSystem.delete()}} to be a no-op if the destination
> path is not present. That's a tested requirement of all Hadoop compatible FS
> and object stores.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]