Github user mgummelt commented on the issue:
https://github.com/apache/spark/pull/17723
Hey @vanzin @mridulm. Sorry for joining the party a bit late. I just read
through the discussion.
It seems the first point of contention is the distinction between Hadoop
and YARN. This PR relies on Hadoop libraries in `core`, but it shouldn't rely
on `yarn`. If it is, that's a mistake and I should fix that.
Then the discussion becomes whether we should rely on Hadoop in `core`. It
looks like @mridulm acknowledges we're already using Hadoop in `core`, so I
hope we agree that this PR doesn't create a *new* problem, but that it does
increase the coupling.
I agree that, ideally, all Hadoop-specific code would be factored out into
a separate `hadoop` module. But, as @vanzin points out, doing so would be a
massive undertaking. Spark identity and access control is based on Hadoop
security (`UserGroupInformation`), and Hadoop filesystems are exposed through
`HadoopRDD`. We'd have to, at minimum, create an entirely new Spark access
control interface for which `UGI` is just one provider.
And also as @vanzin points out, ultimately, there's no way to get around
the requirement of using Hadoop security libraries such as `UGI` if our goal is
to access Hadoop services. Hadoop services require Hadoop delegation tokens,
rather than some more broadly applicable security standard. And I hope we
agree we don't want to duplicate the `UGI` client code in both the `yarn` and
`mesos` module (sharing that client code was the whole motivation of this PR).
So the only alternative I see would be to create a separate `hadoop`
module, place all this code there, and create new interfaces that
Hadoop-specific code would implement. One obstacle to that is the massive
amount of work. The other is that I'm not a huge fan of creating interfaces
when we only have on implementation, since you often end up with the wrong
interface, so you have to rewrite it anyway.
So my proposal is that we acknowledge that splitting out all Hadoop code
from core and placing it behind a common interface should be done at some
point, but we wait to do it until we at least have some other security provider
that would implement that interface.
BTW, I'll definitely go back and ensure that no Hadoop interfaces are
publicly exposed in `core`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]