Github user chesterxgchen commented on the pull request:
https://github.com/apache/spark/pull/4405#issuecomment-73636154
@vanzin
Did you test with the secured Hadoop Cluster or just normal cluster ?
If the hadoop cluster is secured, I think these assumptions are required. I
just finished our Hadoop Kerberos authentication implementation with Pig,
MapReduce, HDFS, Sqoop and Spark recently (For Spark with Yarn Cluster mode).
I don't think you can access the secure cluster without kerberos authentication
( assumption 1). And if the UserGroupInformation uses the SIMPLE mode to access
secured hadoop cluster, you will get exception at certain point ( assumption 3).
In our case, we did not use SparkSubmit, but directly use Yarn Client. I
don't understand why the standalone mode or messos mode won't need have job
delegation token ? Maybe you can elaborate that a bit more.
If you see in oozie's implementation, you can see that before the MR job is
submitted, the job delegation is added to the Jobclient's credential. This is
regardless using Yarn or not.
Another question related to the overall approach. This seems to be fine
with the command line calling SparkSubmmit. As the current user can be
authenticated with kinit, and the proxy-user can be impersonated via
createProxyUser. The user who manages the spark job submit is responsible for
managing kerberos TGT lifetime and renewal etc. If the ticket is expired, user
can re-run the kinit or use the cron job to keep it from expire. In this case,
the spark is merely create proxy user.
For application ( for example, a programs that submit the spark job
directly, not from command line), this seems approach doesn't seem to help
much. As the application can createProxyUser in its program instead of let
spark do it, application already do the kerberosLogin (
UserGroupInformation.loginUserFromKeytab), renew
UserGroupInformation.checkTGTAndReloginFromKeytab, handle ticket expiration,
add job token etc.
So is the approach is only intended for command line use ? does it make
sense to push more logic into spark ? Or this logic doesn't belong to spark ?
thanks
Chester Chen
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]