Github user chesterxgchen commented on the pull request:

    https://github.com/apache/spark/pull/4405#issuecomment-73636154
  
    @vanzin 
    
        Did you test with the secured Hadoop Cluster or just normal cluster ?  
If the hadoop cluster is secured, I think these assumptions are required. I 
just finished our Hadoop Kerberos authentication implementation with Pig, 
MapReduce, HDFS, Sqoop and Spark recently (For Spark with Yarn Cluster mode).  
I don't think you can access the secure cluster without kerberos authentication 
( assumption 1). And if the UserGroupInformation uses the SIMPLE mode to access 
secured hadoop cluster, you will get exception at certain point ( assumption 3).
    
    In our case, we did not use SparkSubmit, but directly use Yarn Client. I 
don't understand why the standalone mode or messos mode won't need have job 
delegation token ? Maybe you can elaborate that a bit more.  
    
    If you see in oozie's implementation, you can see that before the MR job is 
submitted, the job delegation is added to the Jobclient's credential.  This is 
regardless using Yarn or not. 
    
    Another question related to the overall approach. This seems to be fine 
with the command line calling SparkSubmmit.  As the current user can be 
authenticated with kinit, and the proxy-user can be impersonated via 
createProxyUser.  The user who manages the spark job submit is responsible for 
managing kerberos TGT lifetime and renewal etc. If the ticket is expired, user 
can re-run the kinit or use the cron job to keep it from expire. In this case, 
the spark is merely create proxy user. 
    
    For application ( for example, a programs that submit the spark job 
directly, not from command line), this seems approach doesn't seem to help 
much. As the application can createProxyUser in its program instead of let 
spark do it, application already do the kerberosLogin ( 
UserGroupInformation.loginUserFromKeytab), renew 
UserGroupInformation.checkTGTAndReloginFromKeytab, handle ticket expiration, 
add job token etc. 
    
    So is the approach is only intended for command line use ? does it make 
sense to push more logic into spark ? Or this logic doesn't belong to spark ? 
    
    
    thanks
    Chester Chen
    
    
    
    
    
    
     
    
    
      
        
        


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to