Kousuke Saruta created SPARK-57703:
--------------------------------------

             Summary: OIDC Credential Propagation
                 Key: SPARK-57703
                 URL: https://issues.apache.org/jira/browse/SPARK-57703
             Project: Spark
          Issue Type: Umbrella
          Components: Spark Core
    Affects Versions: 4.3.0
            Reporter: Kousuke Saruta


SPIP docs: 
[https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?tab=t.0#heading=h.fkw9v2um6wxa|https://docs.google.com/document/d/1usJKncCPMiyFUg7aIdpZ0HQsklXIHow_sU_6dfFMjN0/edit?usp=sharing]

–

*Q1. What are you trying to do?*

Add a mechanism for Spark on Kubernetes to propagate OIDC-based credentials 
from the driver to executors, enabling per-workload (and eventually per-user) 
access control on cloud storage (S3, ADLS, GCS) and S3-compatible systems 
(MinIO, Ceph). This is the Kubernetes + OIDC equivalent of what Kerberos + 
delegation tokens provide on YARN + HDFS.

*Q2. What problem is this proposal NOT designed to solve?*
 - Spark Connect authentication (separate concern, future work)
 - Azure / GCP provider implementations (future, same SPI)
 - Hive Metastore / catalog OIDC integration (future)
 - Modifying Hadoop UserGroupInformation
 - Per-task multi-user credential scoping

*Q3. How is it done today, and what are the limits of current practice?*

Today, all Spark jobs on Kubernetes access cloud storage as the pod's service 
account (IRSA / Pod Identity). There is no Spark-level mechanism to carry an 
OIDC token through the driver to executors and convert it into per-workload 
storage credentials. Workarounds are: one cluster per user (expensive), shared 
over-privileged service account (insecure), or custom credential provider 
implementations (unmaintainable).

*Q4. What is new in your approach and why do you think it will be successful?*

We introduce a `CredentialProvider` SPI that generalizes OAuth 2.0 Token 
Exchange (RFC 8693). The driver reads a projected ServiceAccount token (OIDC 
JWT), calls the SPI to obtain temporary storage credentials, and distributes 
them to executors via a new `UpdateUserCredentials` RPC — mirroring the 
existing `HadoopDelegationTokenManager` / `UpdateDelegationTokens` pattern. 
This approach succeeds because it follows an established Spark pattern, 
coexists with Kerberos, and requires no changes to user application code.

*Q5. Who cares? If you are successful, what difference will it make?*
 - Spark-on-Kubernetes operators running multi-tenant clusters
 - EMR on EKS / Dataproc on GKE / AKS users needing per-workload access control
 - Compliance-driven organizations requiring audit trails (CloudTrail)
 - Users of S3-compatible on-premises storage (MinIO, Ceph) on Kubernetes
 - Organizations choosing Trino over Spark due to Kerberos operational overhead

*Q6. What are the risks?*
 - SPI shape may not fit all future requirements (mitigated: `@DeveloperApi`, 
allowing evolution)
 - Regression risk from modifying `CoarseGrainedSchedulerBackend` (mitigated: 
gated by `spark.security.oidc.enabled=false`, minimal changes)
 - Token file race condition at driver startup (mitigated: retry with backoff)

*Q7. How long will it take?*
 - Core SPI (types, ingestor, manager, RPC): 4–6 weeks
 - S3/STS reference provider: 2–3 weeks additional
 - Total: 2–3 months from approval to merge (single release cycle)

*Q8. What are the mid-term and final "exams" to check for success?*

Mid-term: Core SPI merged. A local-mode job with a Fake CredentialProvider 
observes UserCredentials delivered to executors and refreshed on token 
rotation, with no regressions to Kerberos tests.

Final: Reference provider merged. A Spark job on Kubernetes accesses S3 (or 
LocalStack) using OIDC-derived credentials with automatic refresh during a 
long-running job.

–

*Appendix A: API Changes*

New `@DeveloperApi` types in `org.apache.spark.security`:
 - `UserContext` (principal, issuer, rawToken, issuedAt, expiresAt)
 - `ServiceCredential` (properties: Map[String, String], expiresAt)
 - `CredentialProvider` trait (init, supportedSchemes, resolve)

New configuration keys: {*}spark.security.oidc.*{*}, *spark.kubernetes.oidc.**

New RPC message: `UpdateUserCredentials(payload: Array[Byte])`

New field in `SparkAppConfig`: `userCredentials: Option[Array[Byte]]`

Backward compatible: all gated by `spark.security.oidc.enabled=false` (default).

*Appendix B: Design Sketch*

See attached architecture diagram and sequence diagram.

The design mirrors the existing Kerberos credential propagation:
 - `FileTokenIngestor` reads the projected SA token (analogous to keytab)
 - `UserCredentialManager` orchestrates renewal (analogous to 
`HadoopDelegationTokenManager`)
 - `CredentialProvider.resolve()` obtains credentials (analogous to 
`HadoopDelegationTokenProvider.obtainDelegationTokens()`)
 - `UpdateUserCredentials` RPC distributes to executors (analogous to 
`UpdateDelegationTokens`)

Reference provider (`connector/credential-aws`) calls STS 
`AssumeRoleWithWebIdentity` and works with any STS-compatible endpoint (AWS, 
MinIO, Ceph).

*Appendix C: Rejected Designs*
 - Extend `HadoopDelegationTokenManager` for OIDC - rejected (Kerberos-centric, 
UGI dependency)
 - Use UGI to carry OIDC identity - rejected (tight Kerberos coupling)
 - Ship AWS STS in core - rejected (vendor neutrality)
 - Per-task UserContext from day one - rejected (not needed until Connect)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to