Yeah, so I think there are 2 separate issues here:


  1.  The coupling of the SparkSession + SparkContext in their current form 
seem unnatural
  2.  The current memory leak, which I do believe is a case where the session 
is added onto the spark context, but is only needed by the session (but would 
appreciate a sanity check here). Meaning, it may make sense to investigate an 
API change.



Thoughts?



On 4/2/19, 15:13, "Sean Owen" <sro...@gmail.com> wrote:

    > @Sean – To the point that Ryan made, it feels wrong that stopping a 
session force stops the global context. Building in the logic to only stop the 
context when the last session is stopped also feels like a solution, but the 
best way I can think about doing this involves storing the global list of every 
available SparkSession, which may be difficult.



    I tend to agree it would be more natural for the SparkSession to have

    its own lifecycle 'stop' method that only stops/releases its own

    resources. But is that the source of the problem here? if the state

    you're trying to free is needed by the SparkContext, it won't help. If

    it happens to be in the SparkContext but is state only needed by one

    SparkSession and that there isn't any way to clean up now, that's a

    compelling reason to change the API.  Is that the situation? The only

    downside is making the user separately stop the SparkContext then.

From: Vinoo Ganesh <vgan...@palantir.com>
Date: Tuesday, April 2, 2019 at 13:24
To: Arun Mahadevan <ar...@apache.org>, Ryan Blue <rb...@netflix.com>
Cc: Sean Owen <sro...@gmail.com>, "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Closing a SparkSession stops the SparkContext

// Merging threads

Thanks everyone for your thoughts. I’m very much in sync with Ryan here.

@Sean – To the point that Ryan made, it feels wrong that stopping a session 
force stops the global context. Building in the logic to only stop the context 
when the last session is stopped also feels like a solution, but the best way I 
can think about doing this involves storing the global list of every available 
SparkSession, which may be difficult.

@Arun – If the intention is not to be able to clear and create new sessions, 
then what specific is the intended use case of Sessions? 
https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
 
[databricks.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_blog_2016_08_15_how-2Dto-2Duse-2Dsparksession-2Din-2Dapache-2Dspark-2D2-2D0.html&d=DwMGaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=gSbHnTWozD5jH8QNJAVS16x0z9oydSZcgWVhXacfk00&s=L_WFiV7KKSXbpJfUKioz7JL3SSS-QhRAZOzS2OpTI3c&e=>
 describes SparkSessions as time bounded interactions which implies that old 
ones should be clear-able an news ones create-able in lockstep without adverse 
effect?

From: Arun Mahadevan <ar...@apache.org>
Date: Tuesday, April 2, 2019 at 12:31
To: Ryan Blue <rb...@netflix.com>
Cc: Vinoo Ganesh <vgan...@palantir.com>, Sean Owen <sro...@gmail.com>, 
"dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: Closing a SparkSession stops the SparkContext

I am not sure how would it cause a leak though. When a spark session or the 
underlying context is stopped it should clean up everything. The getOrCreate is 
supposed to return the active thread local or the global session. May be if you 
keep creating new sessions after explicitly clearing the default and the local 
sessions and keep leaking the sessions it could happen, but I don't think 
Sessions are intended to be used that way.

On Tue, 2 Apr 2019 at 08:45, Ryan Blue <rb...@netflix.com.invalid> wrote:
I think Vinoo is right about the intended behavior. If we support multiple 
sessions in one context, then stopping any one session shouldn't stop the 
shared context. The last session to be stopped should stop the context, but not 
any before that. We don't typically run multiple sessions in the same context 
so we haven't hit this, but it sounds reasonable.


On 4/2/19, 11:44, "Sean Owen" <sro...@gmail.com> wrote:



    Yeah there's one global default session, but it's possible to create

    others and set them as the thread's active session, to allow for

    different configurations in the SparkSession within one app. I think

    you're asking why closing one of them would effectively shut all of

    them down by stopping the SparkContext. My best guess is simply, well,

    that's how it works. You'd only call this, like SparkContext.stop(),

    when you know the whole app is done and want to clean up. SparkSession

    is a kind of wrapper on SparkContext and it wouldn't be great to make

    users stop all the sessions and go find and stop the context.



    If there is some per-SparkSession state that needs a cleanup, then

    that's a good point, as I don't see a lifecycle method that means

    "just close this session".

    You're talking about SparkContext state though right, and there is

    definitely just one SparkContext though. It can/should only be stopped

    when the app is really done.



    Is the point that each session is adding some state to the context and

    doesn't have any mechanism for removing it?


On Tue, Apr 2, 2019 at 8:23 AM Vinoo Ganesh 
<vgan...@palantir.com<mailto:vgan...@palantir.com>> wrote:
Hey Sean - Cool, maybe I'm misunderstanding the intent of clearing a session 
vs. stopping it.

The cause of the leak looks to be because of this line here 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131
 
[github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_util_QueryExecutionListener.scala-23L131&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=KIwH3bmyCIzxTh975_fcntmBD66iY1eMpUjQGCIj5CE&s=m3TMVURR5i3oR6XWWcWO97RPEjjsWiY4ZS9vlnF3Ktk&e=>.
 The ExecutionListenerBus that's added persists forever on the context's 
listener bus (the SparkContext ListenerBus has an ExecutionListenerBus). I'm 
trying to figure out the place that this cleanup should happen.

With the current implementation, calling SparkSession.stop will clean up the 
ExecutionListenerBus (since the context itself is stopped), but it's unclear to 
me why terminating one session should terminate the JVM-global context. 
Possible my mental model is off here, but I would expect stopping a session to 
remove all traces of that session, while keeping the context alive, and 
stopping a context would, well, stop the context.

If stopping the session is expected to stop the context, what's the intended 
usage of clearing the active / default session?

Vinoo

On 4/2/19, 10:57, "Sean Owen" <sro...@gmail.com<mailto:sro...@gmail.com>> wrote:

    What are you expecting there ... that sounds correct? something else
    needs to be closed?

    On Tue, Apr 2, 2019 at 9:45 AM Vinoo Ganesh 
<vgan...@palantir.com<mailto:vgan...@palantir.com>> wrote:
    >
    > Hi All -
    >
    >    I’ve been digging into the code and looking into what appears to be a 
memory leak 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_SPARK-2D27337&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=JUsN7EzGimus0jYxyj47_xHYUDC6KnxieeUBfUKTefk&e=)
 and have noticed something kind of peculiar about the way closing a 
SparkSession is handled. Despite being marked as Closeable, closing/stopping a 
SparkSession simply stops the SparkContext. This changed happened as a result 
of one of the PRs addressing 
https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_SPARK-2D15073&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=Nd9eBDH-FDdzEn_BVt2nZaNQn6fXA8EfVq5rKGztOUo&e=
 in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_12873_files-23diff-2Dd91c284798f1c98bf03a31855e26d71cR596&d=DwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=TjtXLhnSM5M_aKQlD2NFU2wRnXPvtrUbRm-t84gBNlY&s=RM9LrT3Yp2mf1BcbBf1o_m3bcNZdOjznrogBLzUzgeE&e=.
    >
    >
    >
    > I’m trying to understand why this is the intended behavior – anyone have 
any knowledge of why this is the case?
    >
    >
    >
    > Thanks,
    >
    > Vinoo



---------------------------------------------------------------------
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>


--
Ryan Blue
Software Engineer
Netflix

Reply via email to