Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/17723
> Support for long running applications (which require token renewal, etc)
was added much later in spark
That's different and not what this change is about. Support for Hadoop
security (i.e. delegation tokens) has existed at least since Spark 1.0 (I'm
hazy before that since that's when I started playing with Spark). And it
doesn't change the fact that it's the only custom security framework that
people have ever tried to use with Spark, as far as I'm aware.
Hadoop security is different and puts a lot of burden on clients to do
things; there are good reasons for that, but it means that it's not as simple
as just providing a password. I wish there was a library that made all this
easier (wouldn't it be great if there was a single service to contact and ask
for "delegation token for service X", like the Kerberos TGS?), but that's not
the case.
I can think of 3 other types of systems that Spark supports (directly or
through extensions):
- Those with no security. e.g. Kudu (as far as I know that's still in the
roadmap), Kafka 0.8, etc
- Those that are happy with just a simple secret stashed somewhere. e.g.
S3, JDBC drivers, etc. Even though I've never seen it, I also count cert-based
authentication here, since I'm pretty sure you can achieve that with existing
features in Spark.
- Systems that implemented Kerberos-based auth but not Hadoop delegation
tokens. (Looking at you, Kafka 0.10.) That means it's really hard to use those
services in a distributed environment where security is enabled.
None of those require code in Spark to handle things specially (the third
would, but then you'd run into the good reasons why delegation tokens exist in
the first place, so they really should start using them instead).
> If we are not exposing an api for spark core, while maintaining backward
compatibility
I guess I'm a little less queasy than you are about exposing an unstable
API. That's what the "Unstable" annotation means to me. It's an API that is
still being designed, and exposing it serves as both a way to let people write
extensions that fit the model and also collect feedback about things that don't
fit well.
I have issues with the things you're suggesting for a few different reasons.
- keeping the API private means people won't extend it, so we won't get
feedback if there's any.
- moving the API to a separate module is a distinction without a
difference. It will still be a public Spark API, and still should follow the
rules of backwards compatibility. It would just increase coupling since it's
very unlikely that core wouldn't call into that module (since many people have
asked for Hadoop auth support in standalone too - I have some issues with the
security model there but it's a separate discussion).
- trying to work on an abstract interface to rule them all is a wild goose
chase. Unless you can point me to the contrary, we don't have an example of
what a different system would look like, so whatever abstract interface we end
up with will still be heavily modeled after the Hadoop system. Not exposing
Hadoop types is not a great gain if the whole mechanism still works like Hadoop
security. (It makes Spark's handling of backwards compatibility easier,
probably, but here the model is more important than the types exposed in the
API.)
Yes, exposing an unstable interface risks more work in the future. We may
have to change it, and then we have to decide whether to write code to keep
compatibility, or break all users. It's a risk, but at the same time, we have a
few years of only needing to support this model, so it doesn't seem like it's
that big of a risk to me.
In the case that this elusive other system shows up in the future, we'll
probably have a lot more issues; we could try to merge both APIs or just handle
it with a completely separate one. In either case, there will be work. So I'm
really not seeing the benefit of going out of our way to mitigate future work.
It's better to do that work when we have a better idea of what it even looks
like.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]