Hello: Following up on this issue, We think many of neil's issues come from the fact that a kerberos ticket expires after 24 hours and once it does your spark session would not work anymore. We will be extending expiration of tickets somewhat to 2/3 days but main point to take home is that jupyter notebooks do not live forever in the state you live them at, a restart of kernel might be needed.
Please take a look at ticket: https://phabricator.wikimedia.org/T246132 If anybody has been having these similar problems please chime in. Thanks, Nuria On Thu, Feb 20, 2020 at 2:27 AM Luca Toscano <[email protected]> wrote: > Hi Neil, > > I added the Analytics tag to https://phabricator.wikimedia.org/T245097, > and also thanks for filing https://phabricator.wikimedia.org/T245713. We > periodically review tasks in our incoming queue, so we should be able to > help soon, but it may depend on priorities. > > Luca > > Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn < > [email protected]> ha scritto: > >> Another update: I'm continuing to encounter these Spark errors and have >> trouble recovering from them, even when I use proper settings. I've filed >> T245713 <https://phabricator.wikimedia.org/T245713> to discuss this >> further. The specific errors and behavior I'm seeing (for example, whether >> explicitly calling session.stop allows a new functioning session to be >> created) are not consistent, so I'm still trying to make sense of it. >> >> I would greatly appreciate any input or help, even if it's identifying >> places where my description doesn't make sense. >> <https://phabricator.wikimedia.org/T245713> >> <https://phabricator.wikimedia.org/T245713> >> >> On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn <[email protected]> >> wrote: >> >>> Bump! >>> >>> Analytics team, I'm eager to have input from y'all about the best Spark >>> settings to use. >>> >>> On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn <[email protected]> >>> wrote: >>> >>>> I ran into this problem again, and I found that neither session.stop or >>>> newSession got rid of the error. So it's still not clear how to recover >>>> from a crashed(?) Spark session. >>>> >>>> On the other hand, I did figure out why my sessions were crashing in >>>> the first place, so hopefully recovering from that will be a rare need. The >>>> reason is that wmfdata doesn't modify >>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60> >>>> the default Spark when it starts a new session, which was (for example) >>>> causing it to start executors with only ~400 MiB of memory each. >>>> >>>> I'm definitely going to change that, but it's not completely clear what >>>> the recommended settings for our cluster are. I cataloged the different >>>> recommendations at https://phabricator.wikimedia.org/T245097, and it >>>> would very helpful if one of y'all could give some clear recommendations >>>> about what the settings should be for local SWAP, YARN, and "large" YARN >>>> jobs. For example, is it important to increase spark.sql.shuffle.partitions >>>> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local >>>> job when the SWAP servers only have 64 GiB total? >>>> >>>> Thank you! >>>> >>>> >>>> >>>> >>>> On Fri, 7 Feb 2020 at 06:53, Andrew Otto <[email protected]> wrote: >>>> >>>>> Hm, interesting! I don't think many of us have used >>>>> SparkSession.builder.getOrCreate repeatedly in the same process. >>>>> What happens if you manually stop the spark session first, ( >>>>> session.stop() >>>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?) >>>>> or maybe try to explicitly create a new session via newSession() >>>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession> >>>>> ? >>>>> >>>>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Luca! >>>>>> >>>>>> Those were separate Yarn jobs I started later. When I got this error, >>>>>> I found that the Yarn job corresponding to the SparkContext was marked as >>>>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate >>>>>> to >>>>>> open a new one. >>>>>> >>>>>> Any idea what might have caused that or how I could recover without >>>>>> restarting the notebook, which could mean losing a lot of in-progress >>>>>> work? >>>>>> I had already restarted that kernel so I don't know if I'll encounter >>>>>> this >>>>>> problem again. If I do, I'll file a task. >>>>>> >>>>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hey Neil, >>>>>>> >>>>>>> there were two Yarn jobs running related to your notebooks, I just >>>>>>> killed them, let's see if it solves the problem (you might need to >>>>>>> restart >>>>>>> again your notebook). If not, let's open a task and investigate :) >>>>>>> >>>>>>> Luca >>>>>>> >>>>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < >>>>>>> [email protected]> ha scritto: >>>>>>> >>>>>>>> Whoa—I just got the same stopped SparkContext error on the query >>>>>>>> even after restarting the notebook, without an intermediate Java heap >>>>>>>> space >>>>>>>> error. That seems very strange to me. >>>>>>>> >>>>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey there! >>>>>>>>> >>>>>>>>> I was running SQL queries via PySpark (using the wmfdata package >>>>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) >>>>>>>>> on SWAP when one of my queries failed with >>>>>>>>> "java.lang.OutofMemoryError: >>>>>>>>> Java heap space". >>>>>>>>> >>>>>>>>> After that, when I tried to call the spark.sql function again (via >>>>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: >>>>>>>>> Cannot >>>>>>>>> call methods on a stopped SparkContext." >>>>>>>>> >>>>>>>>> When I tried to create a new Spark context using >>>>>>>>> SparkSession.builder.getOrCreate (whether using >>>>>>>>> wmfdata.spark.get_session >>>>>>>>> or directly), it returned a SparkContent object properly, but calling >>>>>>>>> the >>>>>>>>> object's sql function still gave the "stopped SparkContext error". >>>>>>>>> >>>>>>>>> Any idea what's going on? I assume restarting the notebook kernel >>>>>>>>> would take care of the problem, but it seems like there has to be a >>>>>>>>> better >>>>>>>>> way to recover. >>>>>>>>> >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>> Analytics mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Analytics mailing list >>>>>>> [email protected] >>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
