[ 
https://issues.apache.org/jira/browse/SPARK-57657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-57657:
---------------------------------
    Description: 
{{ClientStreamingQuerySuite."listener events"}} is flaky (e.g. Java 21 
connect): the QueryStarted/QueryProgress events arrive but the terminal 
QueryTerminatedEvent is never received, even though the query has stopped.

{{SparkConnectListenerBusListener.send}} removes the server-side listener and 
stops sending ALL further events on the first {{onNext}} exception, so a single 
transient failure on a frequent progress event silently drops the later 
terminate event. Fix: retry onNext a small bounded number of times before 
tearing the listener down (cleanup still happens when the client is genuinely 
unresponsive). Test diagnostics are added to debug any future recurrence in 
scheduled jobs, since the connect server logs are not captured in CI.

> Spark Connect should not drop streaming listener events on a transient send 
> failure
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-57657
>                 URL: https://issues.apache.org/jira/browse/SPARK-57657
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 5.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> {{ClientStreamingQuerySuite."listener events"}} is flaky (e.g. Java 21 
> connect): the QueryStarted/QueryProgress events arrive but the terminal 
> QueryTerminatedEvent is never received, even though the query has stopped.
> {{SparkConnectListenerBusListener.send}} removes the server-side listener and 
> stops sending ALL further events on the first {{onNext}} exception, so a 
> single transient failure on a frequent progress event silently drops the 
> later terminate event. Fix: retry onNext a small bounded number of times 
> before tearing the listener down (cleanup still happens when the client is 
> genuinely unresponsive). Test diagnostics are added to debug any future 
> recurrence in scheduled jobs, since the connect server logs are not captured 
> in CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to