PrabhuJoseph commented on pull request #33674:
URL: https://github.com/apache/spark/pull/33674#issuecomment-989729420
@Shockang Have tested the patch but it does not address the reported issue.
The spark job runs in yarn-client mode where the Driver runs along with Client
creates a new JobConf in HadoopRDD for every partition which internally fetches
a FileSystem Delegation Token. So when there are 1000 partitions - 1000 time
delegation token will be fetched.
The Spark Client gets the FileSystem Delegation Token at start (Client.scala
- setupSecurityToken) and places in the Token file and pass it to the Spark
Application Master & Executors to use. But the Client uses different
credentials which does not have FileSystem Delegation Token as it is using TGT.
(Refer SPARK-15754)
And so every call Driver (Client Mode) makes to list the path creates a
separate JobConf and adds the Client Credentials which does not have FileSystem
token and so obtains a new token.
One simple fix is to expose a config which adds obtained hadoop filesystem
delegation token into the client user credentials if enabled. This will improve
the performance by fetching delegation token only once when running query on a
partitioned table.
Client.scala
private val hadoopConf = new
YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf))
private val isClusterMode = sparkConf.get("spark.submit.deployMode",
"client") == "cluster"
+ private val useDelegationToken =
sparkConf.getBoolean("spark.client.useDelegationToken", false)
// AM related configurations
private val amMemory = if (isClusterMode) {
// and adding delegation tokens could lead to expired or cancelled
tokens being used
// later, as reported in SPARK-15754.
val currentUser = UserGroupInformation.getCurrentUser()
if (SparkHadoopUtil.get.isProxyUser(currentUser)) {
+ if (SparkHadoopUtil.get.isProxyUser(currentUser) || useDelegationToken)
{
+ logInfo("Adding obtained Hadoop Delegation Tokens into User
Credentials")
currentUser.addCredentials(credentials)
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]