Picking up this email thread again around point #1 below, filed https://issues.apache.org/jira/browse/SPARK-27958 and put up a PR (still have to write tests) https://github.com/apache/spark/pull/24807 just to begin the conversation.
From: Ryan Blue <rb...@netflix.com> Reply-To: "rb...@netflix.com" <rb...@netflix.com> Date: Wednesday, April 3, 2019 at 13:14 To: Vinoo Ganesh <vgan...@palantir.com> Cc: Sean Owen <sro...@gmail.com>, Arun Mahadevan <ar...@apache.org>, "dev@spark.apache.org" <dev@spark.apache.org> Subject: Re: Closing a SparkSession stops the SparkContext For #1, do we agree on the behavior? I think that closing a SparkSession should not close the SparkContext unless it is the only session. Evidently, that's not what happens and I consider the current the current behavior a bug. For more context, we're working on the new catalog APIs and how to guarantee consistent operations. Self-joining a table, for example, should use the same version of the table for both scans, and that state should be specific to a session, not global. These plans assume that SparkSession represents a session of interactions, along with a reasonable life-cycle. If that life-cycle includes closing all sessions when you close any session, then we can't really use sessions for this. rb On Wed, Apr 3, 2019 at 9:35 AM Vinoo Ganesh <vgan...@palantir.com<mailto:vgan...@palantir.com>> wrote: Yeah, so I think there are 2 separate issues here: 1. The coupling of the SparkSession + SparkContext in their current form seem unnatural 2. The current memory leak, which I do believe is a case where the session is added onto the spark context, but is only needed by the session (but would appreciate a sanity check here). Meaning, it may make sense to investigate an API change. Thoughts? On 4/2/19, 15:13, "Sean Owen" <sro...@gmail.com<mailto:sro...@gmail.com>> wrote: > @Sean – To the point that Ryan made, it feels wrong that stopping a session force stops the global context. Building in the logic to only stop the context when the last session is stopped also feels like a solution, but the best way I can think about doing this involves storing the global list of every available SparkSession, which may be difficult. I tend to agree it would be more natural for the SparkSession to have its own lifecycle 'stop' method that only stops/releases its own resources. But is that the source of the problem here? if the state you're trying to free is needed by the SparkContext, it won't help. If it happens to be in the SparkContext but is state only needed by one SparkSession and that there isn't any way to clean up now, that's a compelling reason to change the API. Is that the situation? The only downside is making the user separately stop the SparkContext then. From: Vinoo Ganesh <vgan...@palantir.com<mailto:vgan...@palantir.com>> Date: Tuesday, April 2, 2019 at 13:24 To: Arun Mahadevan <ar...@apache.org<mailto:ar...@apache.org>>, Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> Cc: Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>, "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>> Subject: Re: Closing a SparkSession stops the SparkContext // Merging threads Thanks everyone for your thoughts. I’m very much in sync with Ryan here. @Sean – To the point that Ryan made, it feels wrong that stopping a session force stops the global context. Building in the logic to only stop the context when the last session is stopped also feels like a solution, but the best way I can think about doing this involves storing the global list of every available SparkSession, which may be difficult. @Arun – If the intention is not to be able to clear and create new sessions, then what specific is the intended use case of Sessions? https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html [databricks.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_blog_2016_08_15_how-2Dto-2Duse-2Dsparksession-2Din-2Dapache-2Dspark-2D2-2D0.html&d=DwMGaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=gSbHnTWozD5jH8QNJAVS16x0z9oydSZcgWVhXacfk00&s=L_WFiV7KKSXbpJfUKioz7JL3SSS-QhRAZOzS2OpTI3c&e=> describes SparkSessions as time bounded interactions which implies that old ones should be clear-able an news ones create-able in lockstep without adverse effect? From: Arun Mahadevan <ar...@apache.org<mailto:ar...@apache.org>> Date: Tuesday, April 2, 2019 at 12:31 To: Ryan Blue <rb...@netflix.com<mailto:rb...@netflix.com>> Cc: Vinoo Ganesh <vgan...@palantir.com<mailto:vgan...@palantir.com>>, Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>, "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>> Subject: Re: Closing a SparkSession stops the SparkContext I am not sure how would it cause a leak though. When a spark session or the underlying context is stopped it should clean up everything. The getOrCreate is supposed to return the active thread local or the global session. May be if you keep creating new sessions after explicitly clearing the default and the local sessions and keep leaking the sessions it could happen, but I don't think Sessions are intended to be used that way. On Tue, 2 Apr 2019 at 08:45, Ryan Blue <rb...@netflix.com.invalid> wrote: I think Vinoo is right about the intended behavior. If we support multiple sessions in one context, then stopping any one session shouldn't stop the shared context. The last session to be stopped should stop the context, but not any before that. We don't typically run multiple sessions in the same context so we haven't hit this, but it sounds reasonable. On 4/2/19, 11:44, "Sean Owen" <sro...@gmail.com<mailto:sro...@gmail.com>> wrote: Yeah there's one global default session, but it's possible to create others and set them as the thread's active session, to allow for different configurations in the SparkSession within one app. I think you're asking why closing one of them would effectively shut all of them down by stopping the SparkContext. My best guess is simply, well, that's how it works. You'd only call this, like SparkContext.stop(), when you know the whole app is done and want to clean up. SparkSession is a kind of wrapper on SparkContext and it wouldn't be great to make users stop all the sessions and go find and stop the context. If there is some per-SparkSession state that needs a cleanup, then that's a good point, as I don't see a lifecycle method that means "just close this session". You're talking about SparkContext state though right, and there is definitely just one SparkContext though. It can/should only be stopped when the app is really done. Is the point that each session is adding some state to the context and doesn't have any mechanism for removing it? On Tue, Apr 2, 2019 at 8:23 AM Vinoo Ganesh <vgan...@palantir.com<mailto:vgan...@palantir.com>> wrote: Hey Sean - Cool, maybe I'm misunderstanding the intent of clearing a session vs. stopping it. The cause of the leak looks to be because of this line here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131 [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_util_QueryExecutionListener.scala-23L131&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=KIwH3bmyCIzxTh975_fcntmBD66iY1eMpUjQGCIj5CE&s=m3TMVURR5i3oR6XWWcWO97RPEjjsWiY4ZS9vlnF3Ktk&e=>. The ExecutionListenerBus that's added persists forever on the context's listener bus (the SparkContext ListenerBus has an ExecutionListenerBus). I'm trying to figure out the place that this cleanup should happen. With the current implementation, calling SparkSession.stop will clean up the ExecutionListenerBus (since the context itself is stopped), but it's unclear to me why terminating one session should terminate the JVM-global context. Possible my mental model is off here, but I would expect stopping a session to remove all traces of that session, while keeping the context alive, and stopping a context would, well, stop the context. If stopping the session is expected to stop the context, what's the intended usage of clearing the active / default session? Vinoo On 4/2/19, 10:57, "Sean Owen" <sro...@gmail.com<mailto:sro...@gmail.com>> wrote: What are you expecting there ... that sounds correct? something else needs to be closed? On Tue, Apr 2, 2019 at 9:45 AM Vinoo Ganesh <vgan...@palantir.com<mailto:vgan...@palantir.com>> wrote: > > Hi All - > > I’ve been digging into the code and looking into what appears to be a memory leak (https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_SPARK-2D27337&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=JUsN7EzGimus0jYxyj47_xHYUDC6KnxieeUBfUKTefk&e=) and have noticed something kind of peculiar about the way closing a SparkSession is handled. Despite being marked as Closeable, closing/stopping a SparkSession simply stops the SparkContext. This changed happened as a result of one of the PRs addressing https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_SPARK-2D15073&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=Nd9eBDH-FDdzEn_BVt2nZaNQn6fXA8EfVq5rKGztOUo&e= in https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_12873_files-23diff-2Dd91c284798f1c98bf03a31855e26d71cR596&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=RM9LrT3Yp2mf1BcbBf1o_m3bcNZdOjznrogBLzUzgeE&e=. > > > > I’m trying to understand why this is the intended behavior – anyone have any knowledge of why this is the case? > > > > Thanks, > > Vinoo --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix