[ 
https://issues.apache.org/jira/browse/SPARK-57425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088847#comment-18088847
 ] 

Daisuke Taniwaki edited comment on SPARK-57425 at 6/14/26 6:15 AM:
-------------------------------------------------------------------

I have a working fix for this issue on the way to push to the upstream, but I’d 
appreciate your contribution or review of the PR. 

[https://github.com/dtaniwaki/spark/tree/SPARK-57425-connect-reattach-metadata-refresh]

I will create a PR soon. 


was (Author: JIRAUSER313670):
I have a local working fix for this issue on the way to push to the upstream, 
but I’d appreciate your contribution or review of the PR. 

> Reattach iterator cannot recover when short-TTL credentials expire mid-stream
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-57425
>                 URL: https://issues.apache.org/jira/browse/SPARK-57425
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.1.0, 4.0.0, 4.2.0, 5.0.0
>            Reporter: Daisuke Taniwaki
>            Priority: Major
>              Labels: pull-request-available
>
> `ExecutePlanResponseReattachableIterator`
> (`python/pyspark/sql/connect/client/reattach.py`) has a reattach mechanism
> designed to recover when the underlying gRPC stream is broken before
> `ResultComplete`. That recovery is structurally impossible when the
> server enforces a short auth-token TTL (e.g. AWS Athena Spark, 30 min):
> 1. `ExecutePlan` is started with a fresh credential.
> 2. The query runs past the TTL; the server kills the stream with
>    `PERMISSION_DENIED`.
> 3. The default retry policy does not treat `PERMISSION_DENIED` as
>    retryable, so the iterator never even attempts to reattach.
> 4. Even if reattach were attempted, `self._metadata` still holds the
>    expired token captured at `__init__`, so it would immediately fail
>    with the same 403.
> The iterator's own contract ("recover from broken stream") is violated
> for any deployment that combines short token TTLs with long-running
> streams. Both gaps must be fixed for the reattach machinery to do what
> it was designed to do.
> This has not surfaced in typical deployments because four conditions
> must align (short server TTL, a stream that outlives it, a server that
> actively kills the stream on expiry, and reattach firing). Local dev
> without auth, on-prem with long-lived tokens, and short ad-hoc queries
> each violate at least one. Managed federated-credential environments
> hit all four; Athena Spark Connect with its 30-minute auth token is the
> canonical trigger.
> The dbt-athena Spark adapter ships runtime monkey-patches today as a
> verified workaround. They have been in production use long enough to
> confirm the behaviour is safe. The fix here folds the moving parts into
> upstream so the workaround becomes unnecessary.
> Backport requested to branch-4.0, branch-4.1, branch-4.2 — 4.x is what
> managed environments actually run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to