[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

Stavros Kontopoulos (JIRA) Thu, 17 Aug 2017 08:12:15 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ]


Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:11 PM:
----------------------------------------------------------------------

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config(conf=conf)\
    .getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])
    
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized so nothing related to 
SparkConf unless somehow you use spark submit (pyspark calls that btw).



was (Author: skonto):
[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config(conf=conf)\
    .getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])
    
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. 


> Config spark.jars.packages is ignored in SparkSession config
> ------------------------------------------------------------
>
>                 Key: SPARK-21752
>                 URL: https://issues.apache.org/jira/browse/SPARK-21752
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
>     .appName('test-mongo')\
>     .master('local[*]')\
>     .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
>     .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
>     .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
>     .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
>     .appName('test-mongo')\
>     .master('local[*]')\
>     .config(conf=conf)\
>     .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

Reply via email to