Re: [Analytics] SparkContext stopped and cannot be restarted

Nuria Ruiz Tue, 25 Feb 2020 11:05:52 -0800

Hello:

Following up on this issue, We think many of neil's issues come from the
fact that a kerberos ticket expires after 24 hours and once it does your
spark session would not work anymore. We will be extending expiration of
tickets somewhat to 2/3 days but main point to take home is that jupyter
notebooks do not live forever in the state you live them at, a restart of
kernel might be needed.


Please take a look at ticket:
https://phabricator.wikimedia.org/T246132

If anybody has been having these similar problems please chime in.

Thanks,

Nuria

On Thu, Feb 20, 2020 at 2:27 AM Luca Toscano <[email protected]> wrote:

> Hi Neil,
>
> I added the Analytics tag to https://phabricator.wikimedia.org/T245097,
> and also thanks for filing https://phabricator.wikimedia.org/T245713. We
> periodically review tasks in our incoming queue, so we should be able to
> help soon, but it may depend on priorities.
>
> Luca
>
> Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn <
> [email protected]> ha scritto:
>
>> Another update: I'm continuing to encounter these Spark errors and have
>> trouble recovering from them, even when I use proper settings. I've filed
>> T245713 <https://phabricator.wikimedia.org/T245713> to discuss this
>> further. The specific errors and behavior I'm seeing (for example, whether
>> explicitly calling session.stop allows a new functioning session to be
>> created) are not consistent, so I'm still trying to make sense of it.
>>
>> I would greatly appreciate any input or help, even if it's identifying
>> places where my description doesn't make sense.
>> <https://phabricator.wikimedia.org/T245713>
>> <https://phabricator.wikimedia.org/T245713>
>>
>> On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn <[email protected]>
>> wrote:
>>
>>> Bump!
>>>
>>> Analytics team, I'm eager to have input from y'all about the best Spark
>>> settings to use.
>>>
>>> On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn <[email protected]>
>>> wrote:
>>>
>>>> I ran into this problem again, and I found that neither session.stop or
>>>> newSession got rid of the error. So it's still not clear how to recover
>>>> from a crashed(?) Spark session.
>>>>
>>>> On the other hand, I did figure out why my sessions were crashing in
>>>> the first place, so hopefully recovering from that will be a rare need. The
>>>> reason is that wmfdata doesn't modify
>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60>
>>>> the default Spark when it starts a new session, which was (for example)
>>>> causing it to start executors with only ~400 MiB of memory each.
>>>>
>>>> I'm definitely going to change that, but it's not completely clear what
>>>> the recommended settings for our cluster are. I cataloged the different
>>>> recommendations at https://phabricator.wikimedia.org/T245097, and it
>>>> would very helpful if one of y'all could give some clear recommendations
>>>> about what the settings should be for local SWAP, YARN, and "large" YARN
>>>> jobs. For example, is it important to increase spark.sql.shuffle.partitions
>>>> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
>>>> job when the SWAP servers only have 64 GiB total?
>>>>
>>>> Thank you!
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 7 Feb 2020 at 06:53, Andrew Otto <[email protected]> wrote:
>>>>
>>>>> Hm, interesting!  I don't think many of us have used
>>>>> SparkSession.builder.getOrCreate repeatedly in the same process.
>>>>> What happens if you manually stop the spark session first, (
>>>>> session.stop()
>>>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>>>>> or maybe try to explicitly create a new session via newSession()
>>>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>>>>> ?
>>>>>
>>>>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Luca!
>>>>>>
>>>>>> Those were separate Yarn jobs I started later. When I got this error,
>>>>>> I found that the Yarn job corresponding to the SparkContext was marked as
>>>>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate 
>>>>>> to
>>>>>> open a new one.
>>>>>>
>>>>>> Any idea what might have caused that or how I could recover without
>>>>>> restarting the notebook, which could mean losing a lot of in-progress 
>>>>>> work?
>>>>>> I had already restarted that kernel so I don't know if I'll encounter 
>>>>>> this
>>>>>> problem again. If I do, I'll file a task.
>>>>>>
>>>>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Neil,
>>>>>>>
>>>>>>> there were two Yarn jobs running related to your notebooks, I just
>>>>>>> killed them, let's see if it solves the problem (you might need to 
>>>>>>> restart
>>>>>>> again your notebook). If not, let's open a task and investigate :)
>>>>>>>
>>>>>>> Luca
>>>>>>>
>>>>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>>>>> [email protected]> ha scritto:
>>>>>>>
>>>>>>>> Whoa—I just got the same stopped SparkContext error on the query
>>>>>>>> even after restarting the notebook, without an intermediate Java heap 
>>>>>>>> space
>>>>>>>> error. That seems very strange to me.
>>>>>>>>
>>>>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey there!
>>>>>>>>>
>>>>>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>>>>>> on SWAP when one of my queries failed with 
>>>>>>>>> "java.lang.OutofMemoryError:
>>>>>>>>> Java heap space".
>>>>>>>>>
>>>>>>>>> After that, when I tried to call the spark.sql function again (via
>>>>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: 
>>>>>>>>> Cannot
>>>>>>>>> call methods on a stopped SparkContext."
>>>>>>>>>
>>>>>>>>> When I tried to create a new Spark context using
>>>>>>>>> SparkSession.builder.getOrCreate (whether using 
>>>>>>>>> wmfdata.spark.get_session
>>>>>>>>> or directly), it returned a SparkContent object properly, but calling 
>>>>>>>>> the
>>>>>>>>> object's sql function still gave the "stopped SparkContext error".
>>>>>>>>>
>>>>>>>>> Any idea what's going on? I assume restarting the notebook kernel
>>>>>>>>> would take care of the problem, but it seems like there has to be a 
>>>>>>>>> better
>>>>>>>>> way to recover.
>>>>>>>>>
>>>>>>>>> Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] SparkContext stopped and cannot be restarted

Reply via email to