Bump! Analytics team, I'm eager to have input from y'all about the best Spark settings to use.
On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn <[email protected]> wrote: > I ran into this problem again, and I found that neither session.stop or > newSession got rid of the error. So it's still not clear how to recover > from a crashed(?) Spark session. > > On the other hand, I did figure out why my sessions were crashing in the > first place, so hopefully recovering from that will be a rare need. The > reason is that wmfdata doesn't modify > <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60> > the default Spark when it starts a new session, which was (for example) > causing it to start executors with only ~400 MiB of memory each. > > I'm definitely going to change that, but it's not completely clear what > the recommended settings for our cluster are. I cataloged the different > recommendations at https://phabricator.wikimedia.org/T245097, and it > would very helpful if one of y'all could give some clear recommendations > about what the settings should be for local SWAP, YARN, and "large" YARN > jobs. For example, is it important to increase spark.sql.shuffle.partitions > for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local > job when the SWAP servers only have 64 GiB total? > > Thank you! > > > > > On Fri, 7 Feb 2020 at 06:53, Andrew Otto <[email protected]> wrote: > >> Hm, interesting! I don't think many of us have used >> SparkSession.builder.getOrCreate repeatedly in the same process. What >> happens if you manually stop the spark session first, (session.stop() >> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?) >> or maybe try to explicitly create a new session via newSession() >> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession> >> ? >> >> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <[email protected]> >> wrote: >> >>> Hi Luca! >>> >>> Those were separate Yarn jobs I started later. When I got this error, I >>> found that the Yarn job corresponding to the SparkContext was marked as >>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to >>> open a new one. >>> >>> Any idea what might have caused that or how I could recover without >>> restarting the notebook, which could mean losing a lot of in-progress work? >>> I had already restarted that kernel so I don't know if I'll encounter this >>> problem again. If I do, I'll file a task. >>> >>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano <[email protected]> >>> wrote: >>> >>>> Hey Neil, >>>> >>>> there were two Yarn jobs running related to your notebooks, I just >>>> killed them, let's see if it solves the problem (you might need to restart >>>> again your notebook). If not, let's open a task and investigate :) >>>> >>>> Luca >>>> >>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < >>>> [email protected]> ha scritto: >>>> >>>>> Whoa—I just got the same stopped SparkContext error on the query even >>>>> after restarting the notebook, without an intermediate Java heap space >>>>> error. That seems very strange to me. >>>>> >>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey there! >>>>>> >>>>>> I was running SQL queries via PySpark (using the wmfdata package >>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) >>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError: >>>>>> Java heap space". >>>>>> >>>>>> After that, when I tried to call the spark.sql function again (via >>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: >>>>>> Cannot >>>>>> call methods on a stopped SparkContext." >>>>>> >>>>>> When I tried to create a new Spark context using >>>>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session >>>>>> or directly), it returned a SparkContent object properly, but calling the >>>>>> object's sql function still gave the "stopped SparkContext error". >>>>>> >>>>>> Any idea what's going on? I assume restarting the notebook kernel >>>>>> would take care of the problem, but it seems like there has to be a >>>>>> better >>>>>> way to recover. >>>>>> >>>>>> Thank you! >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
