Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/17723
Hi @mgummelt, hopefully I have clarified some of my thinking above.
Responding to specific points below.
> It seems the first point of contention is the distinction between Hadoop
and YARN. This PR relies on Hadoop libraries in core, but it shouldn't rely on
yarn. If it is, that's a mistake and I should fix that.
My query was not regarding the actual credential provider implementations
themselves (which will require their dependencies), but whether spark core
needs to depend on the api.
Put another way, suppose we moved credential provider implementations into
a separate module - will spark core still need to depend on this or not ?
@vanzin's made the point that since we depend on hadoop-client, which
depends on hadoop-security - this does not matter anymore :-)
> Then the discussion becomes whether we should rely on Hadoop in core. It
looks like @mridulm acknowledges we're already using Hadoop in core, so I hope
we agree that this PR doesn't create a new problem, but that it does increase
the coupling.
Hopefully I covered this in my earlier comments.
I was not revisiting use of hadoop in core (that is pervasive in spark),
but whether hadoop-security model is sufficient for what we are attempting.
> And also as @vanzin points out, ultimately, there's no way to get around
the requirement of using Hadoop security libraries such as UGI if our goal is
to access Hadoop services. Hadoop services require Hadoop delegation tokens,
rather than some more broadly applicable security standard. And I hope we agree
we don't want to duplicate the UGI client code in both the yarn and mesos
module (sharing that client code was the whole motivation of this PR).
Definitely agree on not duplicating code !
Paraphrasing what you mentioned earlier and elaborating, I am trying to
understand if:
* We can assume hadoop-security is sufficient for our usecases.
* In this case, we leverage existing implementations as-is (more or less)
- and can expose hadoop-security in our interface definitions (`traits`,
execution environment, how we use the credentials).
* hadoop-security becomes one supported model (and currently only one). For
example, model definition could be:
* Defining pre-requisites (principal/keytab, external credential update,
etc).
* environment setup (`UGI. loginUserFromKeytabAndReturnUGI.doAs`, etc
currently used).
* Application of acquired credentials
(`UGI.getCurrentUser.addCredentials`) at executors and driver.
* credential provider's declaring which model they are for.
* Perhaps some other solution (synthesis of the above ? new ?)
> So the only alternative I see would be to create a separate hadoop
module, place all this code there, and create new interfaces that
Hadoop-specific code would implement. One obstacle to that is the massive
amount of work. The other is that I'm not a huge fan of creating interfaces
when we only have on implementation, since you often end up with the wrong
interface, so you have to rewrite it anyway.
I share your concerns here ! Creating incorrect interface we get stuck
with, a single implementation causing our interfaces to be very specialized,
potentially over-designing.
I am trying to understand if what we have is sufficient or do we need to
look more closely at our dependency on hadoop-security.
I am not very familiar with mesos or kubernetes, but a cursory search
indicated they have other forms of authentication ? If yes can you comment if,
with the current model, will mesos be able to evolve to support others ?
Since I do not have sufficient compelling examples to give, I am unable to
convince @vanzin :-)
> So my proposal is that we acknowledge that decoupling all Hadoop code
into a hadoop module and placing UGI behind a common interface should be done
at some point, but we wait to do it until we at least have some other security
provider that would implement that interface.
Since we are exposing credential providers as an api from core, we will
need to support it.
If it is possible to support other models in future without breaking our
exposed interfaces - that is something which would be an excellent way forward
too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]