[
https://issues.apache.org/jira/browse/SPARK-57703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kousuke Saruta updated SPARK-57703:
-----------------------------------
Summary: SPIP: OIDC Credential Propagation (was: OIDC Credential
Propagation)
> SPIP: OIDC Credential Propagation
> ---------------------------------
>
> Key: SPARK-57703
> URL: https://issues.apache.org/jira/browse/SPARK-57703
> Project: Spark
> Issue Type: Umbrella
> Components: Spark Core
> Affects Versions: 4.3.0
> Reporter: Kousuke Saruta
> Priority: Major
> Labels: SPIP
>
> SPIP document:
> [https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?tab=t.0#heading=h.fkw9v2um6wxa|https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?usp=sharing]
> –
> *Q1. What are you trying to do?*
> Add a mechanism for Spark on Kubernetes to propagate OIDC-based credentials
> from the driver to executors, enabling per-workload (and eventually per-user)
> access control on cloud storage (S3, ADLS, GCS) and S3-compatible systems
> (MinIO, Ceph). This is the Kubernetes + OIDC equivalent of what Kerberos +
> delegation tokens provide on YARN + HDFS.
> *Q2. What problem is this proposal NOT designed to solve?*
> - Spark Connect authentication (separate concern, future work)
> - Azure / GCP provider implementations (future, same SPI)
> - Hive Metastore / catalog OIDC integration (future)
> - Modifying Hadoop UserGroupInformation
> - Per-task multi-user credential scoping
> *Q3. How is it done today, and what are the limits of current practice?*
> Today, all Spark jobs on Kubernetes access cloud storage as the pod's service
> account (IRSA / Pod Identity). There is no Spark-level mechanism to carry an
> OIDC token through the driver to executors and convert it into per-workload
> storage credentials. Workarounds are: one cluster per user (expensive),
> shared over-privileged service account (insecure), or custom credential
> provider implementations (unmaintainable).
> *Q4. What is new in your approach and why do you think it will be successful?*
> We introduce a `CredentialProvider` SPI that generalizes OAuth 2.0 Token
> Exchange (RFC 8693). The driver reads a projected ServiceAccount token (OIDC
> JWT), calls the SPI to obtain temporary storage credentials, and distributes
> them to executors via a new `UpdateUserCredentials` RPC — mirroring the
> existing `HadoopDelegationTokenManager` / `UpdateDelegationTokens` pattern.
> This approach succeeds because it follows an established Spark pattern,
> coexists with Kerberos, and requires no changes to user application code.
> *Q5. Who cares? If you are successful, what difference will it make?*
> - Spark-on-Kubernetes operators running multi-tenant clusters
> - EMR on EKS / Dataproc on GKE / AKS users needing per-workload access
> control
> - Compliance-driven organizations requiring audit trails (CloudTrail)
> - Users of S3-compatible on-premises storage (MinIO, Ceph) on Kubernetes
> - Organizations choosing Trino over Spark due to Kerberos operational
> overhead
> *Q6. What are the risks?*
> - SPI shape may not fit all future requirements (mitigated: `@DeveloperApi`,
> allowing evolution)
> - Regression risk from modifying `CoarseGrainedSchedulerBackend` (mitigated:
> gated by `spark.security.oidc.enabled=false`, minimal changes)
> - Token file race condition at driver startup (mitigated: retry with backoff)
> *Q7. How long will it take?*
> - Core SPI (types, ingestor, manager, RPC): 4–6 weeks
> - S3/STS reference provider: 2–3 weeks additional
> - Total: 2–3 months from approval to merge (single release cycle)
> *Q8. What are the mid-term and final "exams" to check for success?*
> Mid-term: Core SPI merged. A local-mode job with a Fake CredentialProvider
> observes UserCredentials delivered to executors and refreshed on token
> rotation, with no regressions to Kerberos tests.
> Final: Reference provider merged. A Spark job on Kubernetes accesses S3 (or
> LocalStack) using OIDC-derived credentials with automatic refresh during a
> long-running job.
> –
> *Appendix A: API Changes*
> New `@DeveloperApi` types in `org.apache.spark.security`:
> - `UserContext` (principal, issuer, rawToken, issuedAt, expiresAt)
> - `ServiceCredential` (properties: Map[String, String], expiresAt)
> - `CredentialProvider` trait (init, supportedSchemes, resolve)
> New configuration keys: {*}spark.security.oidc.{*}{*},
> *spark.kubernetes.oidc.{*}*
> New RPC message: `UpdateUserCredentials(payload: Array[Byte])`
> New field in `SparkAppConfig`: `userCredentials: Option[Array[Byte]]`
> Backward compatible: all gated by `spark.security.oidc.enabled=false`
> (default).
> *Appendix B: Design Sketch*
> See attached architecture diagram and sequence diagram.
> The design mirrors the existing Kerberos credential propagation:
> - `FileTokenIngestor` reads the projected SA token (analogous to keytab)
> - `UserCredentialManager` orchestrates renewal (analogous to
> `HadoopDelegationTokenManager`)
> - `CredentialProvider.resolve()` obtains credentials (analogous to
> `HadoopDelegationTokenProvider.obtainDelegationTokens()`)
> - `UpdateUserCredentials` RPC distributes to executors (analogous to
> `UpdateDelegationTokens`)
> Reference provider (`connector/credential-aws`) calls STS
> `AssumeRoleWithWebIdentity` and works with any STS-compatible endpoint (AWS,
> MinIO, Ceph).
> *Appendix C: Rejected Designs*
> - Extend `HadoopDelegationTokenManager` for OIDC - rejected
> (Kerberos-centric, UGI dependency)
> - Use UGI to carry OIDC identity - rejected (tight Kerberos coupling)
> - Ship AWS STS in core - rejected (vendor neutrality)
> - Per-task UserContext from day one - rejected (not needed until Connect)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]