yqwang-ms edited a comment on pull request #28323: URL: https://github.com/apache/spark/pull/28323#issuecomment-620338444
For testing, until now, I have did: 1. End to end simple spark job test (like spark word count) on secure Hadoop cluster. 2. All existing unit tests are passed And we can also do a little reasoning, to **prove that this change does not introduce more issue than current spark** (given the code change is small). For current spark, https://github.com/apache/spark/blob/410fa913215665db72f871b556569cba3dc9ee0a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L66-L76 The **transferCredentials** func can only transfer Hadoop creds such as Delegation Tokens. However, other creds stored in UGI.subject.getPrivateCredentials, will be lost here, such as: - Non-Hadoop creds: Such as, [Kafka creds](https://github.com/apache/kafka/blob/f3c8bff311b0e4c4d0e316ac949fe4491f9b107f/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/OAuthBearerLoginModule.java#L395) - Newly supported or 3rd party supported Hadoop creds: Such as to support OAuth/JWT token authn on Hadoop, we need to store the OAuth/JWT token into UGI.subject.getPrivateCredentials. However, these tokens are not supposed to be managed by Hadoop Credentials (currently it is only for Hadoop secret keys and delegation tokens). BTW, if we use SPARK_USER to be the effective user, and UserGroupInformation.getCurrentUser as real user (to impersonate the effective user), we should use **createProxyUser**, instead of, directly transferCredentials from UserGroupInformation.getCurrentUser to SPARK_USER. This is because the transfered creds may not match with the SPARK_USER (such as user name is not matched, so server may choose to reject the creds). You can also check other places in Spark and Hadoop, they all use createProxyUser instead of hacking like transferCredentials. What this change do is, just replace the transferCredentials with createProxyUser, so the creds are matched with the real user, beside, it will not lost any creds stored in UGI.subject.getPrivateCredentials (including the creds transfered by transferCredentials). So, it can **only increase** the possibility that a target RPC server accept our UGI, i.e. successful authentication. **So, it will have no impacts for current working well spark jobs.** For more, see createProxyUser: https://github.com/apache/hadoop/blob/5e0eda5d5f696aba7fc209874d232baf2a50d547/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1442-L1451 For the https://github.com/apache/spark/blob/410fa913215665db72f871b556569cba3dc9ee0a/core/src/main/scala/org/apache/spark/util/Utils.scala#L2410-L2413 , I agree it is not safe to change to use getUserName. So I keep it to use getShortUserName, pls check the change. For this, https://github.com/apache/spark/blob/410fa913215665db72f871b556569cba3dc9ee0a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L844-L849 It has nothing related to current creds lost issue, but it should be a bug in Spark. Because RMPrincipal should be set to the fully qualified name, i.e. getUserName instead of getShortUserName . Pls check hadoop related codes and usage. Again, a fully qualified name, will **only increase** successful authentication. However, if you think it is not suitable or safe in current PR, I can remove it :) The only change, that may impact existing spark jobs, is: Someone previous set SPARK_USER as a fully qualified name (such as yqwang-ms/y...@zz.com), now will get the short UserName from getCurrentUserName. However, for previous spark, if no SPARK_USER is provided, we also return UserGroupInformation.getCurrentUser().getShortUserName(). So, let's align the getCurrentUserName to return short UserName, and I think this is an inevitable change and impact. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org