SHU WANG created SPARK-44272:
--------------------------------
Summary: Path Inconsistency when Operating statCache within Yarn
Client
Key: SPARK-44272
URL: https://issues.apache.org/jira/browse/SPARK-44272
Project: Spark
Issue Type: Bug
Components: Spark Submit
Affects Versions: 3.4.0, 2.3.0, 0.9.1, 3.5.0
Reporter: SHU WANG
The *addResource* from *ClientDistributedCacheManager* can add *FileStatus* to
*statCache* when it is not yet cached. Also, there is a subtle bug from
*isPublic* from
*getVisibility* method. *uri.getPath()* will not retain URI information like
scheme, host, etc. So, the *uri* passed to checkPermissionOfOther will differ
from the original {*}uri{*}.
For example, if uri is "file:/foo.invalid.com:8080/tmp/testing", then
{code:java}
uri.getPath -> /foo.invalid.com:8080/tmp/testing
uri.toString -> file:/foo.invalid.com:8080/tmp/testing{code}
The consequence of this bug is that we will *double RPC calls* when the
resources are remote, which is unnecessary. We see nontrivial overhead when
checking those resources from our HDFS, especially when HDFS is overloaded.
Ref: related code within *ClientDistributedCacheManager*
{code:java}
def addResource(...) {
val destStatus = statCache.getOrElse(destPath.toUri(),
fs.getFileStatus(destPath))
val visibility = getVisibility(conf, destPath.toUri(), statCache)
}
private[yarn] def getVisibility() {
isPublic(conf, uri, statCache)
}
private def isPublic(conf: Configuration, uri: URI, statCache: Map[URI,
FileStatus]): Boolean = {
val current = new Path(uri.getPath()) // Should not use
getPath
checkPermissionOfOther(fs, uri, FsAction.READ, statCache)
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]