Kousuke Saruta created SPARK-57703:
--------------------------------------
Summary: OIDC Credential Propagation
Key: SPARK-57703
URL: https://issues.apache.org/jira/browse/SPARK-57703
Project: Spark
Issue Type: Umbrella
Components: Spark Core
Affects Versions: 4.3.0
Reporter: Kousuke Saruta
SPIP docs:
[https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?tab=t.0#heading=h.fkw9v2um6wxa|https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?usp=sharing]
–
*Q1. What are you trying to do?*
Add a mechanism for Spark on Kubernetes to propagate OIDC-based credentials
from the driver to executors, enabling per-workload (and eventually per-user)
access control on cloud storage (S3, ADLS, GCS) and S3-compatible systems
(MinIO, Ceph). This is the Kubernetes + OIDC equivalent of what Kerberos +
delegation tokens provide on YARN + HDFS.
*Q2. What problem is this proposal NOT designed to solve?*
- Spark Connect authentication (separate concern, future work)
- Azure / GCP provider implementations (future, same SPI)
- Hive Metastore / catalog OIDC integration (future)
- Modifying Hadoop UserGroupInformation
- Per-task multi-user credential scoping
*Q3. How is it done today, and what are the limits of current practice?*
Today, all Spark jobs on Kubernetes access cloud storage as the pod's service
account (IRSA / Pod Identity). There is no Spark-level mechanism to carry an
OIDC token through the driver to executors and convert it into per-workload
storage credentials. Workarounds are: one cluster per user (expensive), shared
over-privileged service account (insecure), or custom credential provider
implementations (unmaintainable).
*Q4. What is new in your approach and why do you think it will be successful?*
We introduce a `CredentialProvider` SPI that generalizes OAuth 2.0 Token
Exchange (RFC 8693). The driver reads a projected ServiceAccount token (OIDC
JWT), calls the SPI to obtain temporary storage credentials, and distributes
them to executors via a new `UpdateUserCredentials` RPC — mirroring the
existing `HadoopDelegationTokenManager` / `UpdateDelegationTokens` pattern.
This approach succeeds because it follows an established Spark pattern,
coexists with Kerberos, and requires no changes to user application code.
*Q5. Who cares? If you are successful, what difference will it make?*
- Spark-on-Kubernetes operators running multi-tenant clusters
- EMR on EKS / Dataproc on GKE / AKS users needing per-workload access control
- Compliance-driven organizations requiring audit trails (CloudTrail)
- Users of S3-compatible on-premises storage (MinIO, Ceph) on Kubernetes
- Organizations choosing Trino over Spark due to Kerberos operational overhead
*Q6. What are the risks?*
- SPI shape may not fit all future requirements (mitigated: `@DeveloperApi`,
allowing evolution)
- Regression risk from modifying `CoarseGrainedSchedulerBackend` (mitigated:
gated by `spark.security.oidc.enabled=false`, minimal changes)
- Token file race condition at driver startup (mitigated: retry with backoff)
*Q7. How long will it take?*
- Core SPI (types, ingestor, manager, RPC): 4–6 weeks
- S3/STS reference provider: 2–3 weeks additional
- Total: 2–3 months from approval to merge (single release cycle)
*Q8. What are the mid-term and final "exams" to check for success?*
Mid-term: Core SPI merged. A local-mode job with a Fake CredentialProvider
observes UserCredentials delivered to executors and refreshed on token
rotation, with no regressions to Kerberos tests.
Final: Reference provider merged. A Spark job on Kubernetes accesses S3 (or
LocalStack) using OIDC-derived credentials with automatic refresh during a
long-running job.
–
*Appendix A: API Changes*
New `@DeveloperApi` types in `org.apache.spark.security`:
- `UserContext` (principal, issuer, rawToken, issuedAt, expiresAt)
- `ServiceCredential` (properties: Map[String, String], expiresAt)
- `CredentialProvider` trait (init, supportedSchemes, resolve)
New configuration keys: {*}spark.security.oidc.*{*}, *spark.kubernetes.oidc.**
New RPC message: `UpdateUserCredentials(payload: Array[Byte])`
New field in `SparkAppConfig`: `userCredentials: Option[Array[Byte]]`
Backward compatible: all gated by `spark.security.oidc.enabled=false` (default).
*Appendix B: Design Sketch*
See attached architecture diagram and sequence diagram.
The design mirrors the existing Kerberos credential propagation:
- `FileTokenIngestor` reads the projected SA token (analogous to keytab)
- `UserCredentialManager` orchestrates renewal (analogous to
`HadoopDelegationTokenManager`)
- `CredentialProvider.resolve()` obtains credentials (analogous to
`HadoopDelegationTokenProvider.obtainDelegationTokens()`)
- `UpdateUserCredentials` RPC distributes to executors (analogous to
`UpdateDelegationTokens`)
Reference provider (`connector/credential-aws`) calls STS
`AssumeRoleWithWebIdentity` and works with any STS-compatible endpoint (AWS,
MinIO, Ceph).
*Appendix C: Rejected Designs*
- Extend `HadoopDelegationTokenManager` for OIDC - rejected (Kerberos-centric,
UGI dependency)
- Use UGI to carry OIDC identity - rejected (tight Kerberos coupling)
- Ship AWS STS in core - rejected (vendor neutrality)
- Per-task UserContext from day one - rejected (not needed until Connect)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]