HyukjinKwon opened a new pull request, #56729:
URL: https://github.com/apache/spark/pull/56729

   ### What changes were proposed in this pull request?
   
   In `SparkConnectListenerBusListener.send`, retry `responseObserver.onNext` a 
small bounded number of
   times before tearing the listener down, instead of removing it on the first 
exception. Also add
   diagnostic context to the `ClientStreamingQuerySuite."listener events"` 
assertions and make the test
   listener's fields `@volatile`.
   
   ### Why are the changes needed?
   
   `ClientStreamingQuerySuite."listener events"` is flaky: the 
`QueryStarted`/`QueryProgress` events
   arrive but the terminal `QueryTerminatedEvent` is never received even though 
the query has stopped.
   The server-side listener removes itself and stops sending **all** further 
events on the first
   `onNext` failure, so a single transient gRPC hiccup on a frequent progress 
event silently drops the
   later terminate event. A bounded retry keeps the listener alive across 
transient failures while still
   cleaning up when the client is genuinely unresponsive.
   
   The connect server runs in a separate process whose logs are not captured in 
CI, so the exact failure
   is inferred; the added assertion diagnostics (`diag(stage)`) surface the 
client-side state if this
   test ever flakes again in a scheduled job, to confirm/refine the root cause.
   
   **Before (failing in apache/spark CI):** `listener events` 90s timeout, 
`terminate` empty —
   https://github.com/apache/spark/actions/runs/28004202389/job/82884598238
   
   **After (this change, validated on a fork):** full connect module green and
   `ClientStreamingQuerySuite."listener events"` re-run 8x with 0 failures —
   https://github.com/HyukjinKwon/spark/actions/runs/28074772169
   
   ### Does this PR introduce any user-facing change?
   
   No. Server hardening + test diagnostics only.
   
   ### How was this patch tested?
   
   Re-ran the full connect module and `ClientStreamingQuerySuite."listener 
events"` 8x on CI (link
   above); all green. Existing `SparkConnectListenerBusListenerSuite` 
onNext-throw tests still pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, drafted with Claude Code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to