vinooganesh commented on issue #24807: [SPARK-27958][SQL] Stopping a SparkSession should not always stop Spark Context URL: https://github.com/apache/spark/pull/24807#issuecomment-604778033 Hi All - Apologies for the delay here (I actually switched companies since I initially worked on this and realized I didn't have access to the work I previously did anymore). I did some brushing up on this PR (it's been a while since I've touched it), and re-found a few things. 1. @HyukjinKwon - to your point about fixing the leak without changing the contract between a `SparkContext` and `SparkSession`, it's actually not a trivial fix. The leak comes from the fact that the `SparkSession` attaches a listener to singleton `SparkContext` (which the context will forever hold onto). Logically, at the end of the lifecycle of the `SparkSession` instance, I should be able to drop the listener from the context. We can do this a per-instance basis for a spark session, but that's where the weirdness comes in (see point 2). 2. The weirdness comes from the fact that lifecycle operations are permitted on both the singleton `SparkSession` object as well as the `session` instance created by `SparkSession.getOrCreate(...)`. Specifically, I can call `stop()` and kill the `SparkContext` on *any instance* of a spark session. That seems wrong, especially given an expected operating model where the active session and default session can be different. 3. The concrete problem here is that there isn't a way to clean up an instance of a spark session without killing the context as a whole. Here's my proposal: 1. I'll need to introduce a way to "end" an instance of a SparkSession (ie. mark it ready to be GCed) on a per-instance basis (the class, not the singleton) to fix the memory leak. I propose adding a new lifecycle method (maybe `end()`?) to mark an instance of a spark session as ready to be removed. In this method, I'll also clean up the listener leak. 2. The singleton (the `SparkSession` object) methods `clearActiveSession()` and `clearDefaultSession()` operate in a kind of strange way. The latter drops the singleton's reference to the default session, but doesn't actually clean up the listener state associated with the context. We've been able to get by the issues here thus far simply because there isn't truly a way to stop a spark session - stopping the session, just stops the context. If these sessions are meant to be lightweight, then we need a way to spin these up and tear them down easily, without affecting the underlying context. Meaning in a regular operating mode I could mark the instance of my spark session for GC (ie. spark.end()), and have the garbage collector clean it up (unless the `SparkSession` object still has a reference to it - which is expected behavior). 3. The point is valid that folks rely on the - albeit strange - behavior of stopping a session stoping the global context. We can largely leave the current `spark.stop()` method unaffected (though I think we should rename / proxy this to a new `spark.stopContext()` method). The plan above fixes the leak and changes the operating model without affecting the underlying functionality that people have been aware of. It still leaves it to the user to stop the context manually at the end of the operation of their last SparkSession, but I think it's an improvement to what we have now. Thoughts? cc @cloud-fan @jiangxb1987 @srowen @rdblue
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
