rangadi opened a new pull request, #40937:
URL: https://github.com/apache/spark/pull/40937
### What changes were proposed in this pull request?
This fixes couple of important issues related to session management for
streaming queries.
1. Session mapping should be maintained at connect server as long as the
streaming query is active, even if there are no accesses from the client side.
Currently the session mapping is dropped after 1 hour of inactivity.
2. When streaming query is stopped, the Spark session drops its reference to
the streaming query object. That implies it can not accessed by remote
spark-connect client. It is common usage pattern for users to access a
streaming query after it is is stopped (e.g. to check its metrics, any
exception if failed, etc).
- This is not a problem in legacy mode since the user code in the REPL
keeps the reference. This is no longer the case in Spark-Connect.
*Solution*: This PR adds `SparkConnectStreamingQueryCache` that does not the
following:
* Each new streaming query is registered with this cache.
* It runs a periodic task that checks the status of these queries and
polls session mapping in connect-server so that the session stays alive.
* When query is stopped, it cached for 1 hour more so the it can be
accessed from remote client.
### Why are the changes needed?
- Explained in the above description.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Unit tests
- Manual testing
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]