[ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rafal Wojdyla updated SPARK-38438:
----------------------------------
    Description: 
Reproduction:

{code:python}
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# later on we want to update jars.packages, here's e.g. spark-hats
s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# line below return None, the config was not propagated:
s._sc._conf.get("spark.jars.packages")
{code}

Stopping the context doesn't help, in fact it's even more confusing, because 
the configuration is updated, but doesn't have an effect:

{code:python}
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

s.stop()

s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
# doesn't download the jar/package, as it would if there was no global context
# thus the extra package is unusable. It's not downloaded, or added to the
# classpath.
s._sc._conf.get("spark.jars.packages")
{code}

One workaround is to stop the context AND kill the JVM gateway, which seems to 
be a kind of hard reset:

{code:python}
from pyspark import SparkContext
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
SparkContext._gateway = None
SparkContext._jvm = None

s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# Now we are guaranteed there's a new spark session, and packages
# are downloaded, added to the classpath etc.
{code}

  was:
Reproduction:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# later on we want to update jars.packages, here's e.g. spark-hats
s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# line below return None, the config was not propagated:
s._sc._conf.get("spark.jars.packages")
{code}

Stopping the context doesn't help, in fact it's even more confusing, because 
the configuration is updated, but doesn't have an effect:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

s.stop()

s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
# doesn't download the jar/package, as it would if there was no global context
# thus the extra package is unusable. It's not downloaded, or added to the
# classpath.
s._sc._conf.get("spark.jars.packages")
{code}

One workaround is to stop the context AND kill the JVM gateway, which seems to 
be a kind of hard reset:

{code:python}
from pyspark import SparkConf
from pyspark.sql import SparkSession

# default session:
s = SparkSession.builder.getOrCreate()

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
SparkContext._gateway = None
SparkContext._jvm = None

s = (SparkSession.builder
     .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
     .getOrCreate())

# Now we are guaranteed there's a new spark session, and packages
# are downloaded, added to the classpath etc.
{code}


> Can't update spark.jars.packages on existing global/default context
> -------------------------------------------------------------------
>
>                 Key: SPARK-38438
>                 URL: https://issues.apache.org/jira/browse/SPARK-38438
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, Spark Core
>    Affects Versions: 3.2.1
>         Environment: py: 3.9
> spark: 3.2.1
>            Reporter: Rafal Wojdyla
>            Priority: Major
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>      .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>      .getOrCreate())
> # line below return None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>      .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>      .getOrCreate())
> # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
> # doesn't download the jar/package, as it would if there was no global context
> # thus the extra package is unusable. It's not downloaded, or added to the
> # classpath.
> s._sc._conf.get("spark.jars.packages")
> {code}
> One workaround is to stop the context AND kill the JVM gateway, which seems 
> to be a kind of hard reset:
> {code:python}
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> SparkContext._gateway = None
> SparkContext._jvm = None
> s = (SparkSession.builder
>      .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>      .getOrCreate())
> # Now we are guaranteed there's a new spark session, and packages
> # are downloaded, added to the classpath etc.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to