Re: [Analytics] SparkContext stopped and cannot be restarted

Neil Shah-Quinn Wed, 19 Feb 2020 10:36:31 -0800

Bump!

Analytics team, I'm eager to have input from y'all about the best Spark
settings to use.


On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn <[email protected]>
wrote:

> I ran into this problem again, and I found that neither session.stop or
> newSession got rid of the error. So it's still not clear how to recover
> from a crashed(?) Spark session.
>
> On the other hand, I did figure out why my sessions were crashing in the
> first place, so hopefully recovering from that will be a rare need. The
> reason is that wmfdata doesn't modify
> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60>
> the default Spark when it starts a new session, which was (for example)
> causing it to start executors with only ~400 MiB of memory each.
>
> I'm definitely going to change that, but it's not completely clear what
> the recommended settings for our cluster are. I cataloged the different
> recommendations at https://phabricator.wikimedia.org/T245097, and it
> would very helpful if one of y'all could give some clear recommendations
> about what the settings should be for local SWAP, YARN, and "large" YARN
> jobs. For example, is it important to increase spark.sql.shuffle.partitions
> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
> job when the SWAP servers only have 64 GiB total?
>
> Thank you!
>
>
>
>
> On Fri, 7 Feb 2020 at 06:53, Andrew Otto <[email protected]> wrote:
>
>> Hm, interesting!  I don't think many of us have used
>> SparkSession.builder.getOrCreate repeatedly in the same process.  What
>> happens if you manually stop the spark session first, (session.stop()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>> or maybe try to explicitly create a new session via newSession()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>> ?
>>
>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <[email protected]>
>> wrote:
>>
>>> Hi Luca!
>>>
>>> Those were separate Yarn jobs I started later. When I got this error, I
>>> found that the Yarn job corresponding to the SparkContext was marked as
>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>>> open a new one.
>>>
>>> Any idea what might have caused that or how I could recover without
>>> restarting the notebook, which could mean losing a lot of in-progress work?
>>> I had already restarted that kernel so I don't know if I'll encounter this
>>> problem again. If I do, I'll file a task.
>>>
>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano <[email protected]>
>>> wrote:
>>>
>>>> Hey Neil,
>>>>
>>>> there were two Yarn jobs running related to your notebooks, I just
>>>> killed them, let's see if it solves the problem (you might need to restart
>>>> again your notebook). If not, let's open a task and investigate :)
>>>>
>>>> Luca
>>>>
>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>> [email protected]> ha scritto:
>>>>
>>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>>> after restarting the notebook, without an intermediate Java heap space
>>>>> error. That seems very strange to me.
>>>>>
>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey there!
>>>>>>
>>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
>>>>>> Java heap space".
>>>>>>
>>>>>> After that, when I tried to call the spark.sql function again (via
>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: 
>>>>>> Cannot
>>>>>> call methods on a stopped SparkContext."
>>>>>>
>>>>>> When I tried to create a new Spark context using
>>>>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
>>>>>> or directly), it returned a SparkContent object properly, but calling the
>>>>>> object's sql function still gave the "stopped SparkContext error".
>>>>>>
>>>>>> Any idea what's going on? I assume restarting the notebook kernel
>>>>>> would take care of the problem, but it seems like there has to be a 
>>>>>> better
>>>>>> way to recover.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] SparkContext stopped and cannot be restarted

Reply via email to