[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

Stavros Kontopoulos (JIRA) Fri, 18 Aug 2017 13:20:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133565#comment-16133565
 ]


Stavros Kontopoulos edited comment on SPARK-21752 at 8/18/17 8:19 PM:
----------------------------------------------------------------------

[~jsnowacki] What I am doing is not manual its just another legitimate way to 
start jupyter. Btw its far from manual as it works out of the box, but anyway 
the point here is Spark's config api consistency (since its a public API).
I agree the other way to start things is more pythonic and that way is very 
manual IMHO but its ok since its common practice (that is why I insisted for 
all the details).

Now I did use pip install pyspark, then `jupyter notebook` (or could have just 
used plain python)
and followed your example and I got the following (some results verify what you 
have already observed):

a) setting env variable always works without caring about configuration:

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .getOrCreate()
{code}

[I 22:50:58.697 NotebookApp] Adapting to protocol v5.1 for kernel 
d05897ed-6de4-4ec2-842f-adb094bf0f0d
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 22:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 22:52:05 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 22:52:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

b)
Example 1 for me works without issues:

{code:java}
 import pyspark
spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
    .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
    .getOrCreate()
{code}

Output:
Creating new notebook in 
[I 23:03:52.055 NotebookApp] Kernel started: 
bc93a17a-e7a5-4e83-8a63-df0adba97c79
[W 23:03:52.058 NotebookApp] 404 GET 
/nbextensions/widgets/notebook/js/extension.js?v=20170818230343 (127.0.0.1) 
1.46ms 
referer=http://localhost:8888/notebooks/Untitled2.ipynb?kernel_name=python3
[I 23:04:21.361 NotebookApp] Adapting to protocol v5.1 for kernel 
bc93a17a-e7a5-4e83-8a63-df0adba97c79
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 169ms :: artifacts dl 4ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 23:04:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 23:04:22 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:04:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

c) Example 2 works as expected:

{code:java}
import pyspark

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config(conf=conf)\
    .getOrCreate()
{code}

[I 23:07:13.494 NotebookApp] Creating new notebook in 
[I 23:07:13.836 NotebookApp] Kernel started: 
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:13.840 NotebookApp] 404 GET 
/nbextensions/widgets/notebook/js/extension.js?v=20170818230708 (127.0.0.1) 
2.09ms 
referer=http://localhost:8888/notebooks/Untitled3.ipynb?kernel_name=python3
[I 23:07:14.353 NotebookApp] Adapting to protocol v5.1 for kernel 
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:16.413 NotebookApp] Replacing stale connection: 
bc93a17a-e7a5-4e83-8a63-df0adba97c79:D7862E2BCD2F4111883A382D5EA7714D
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/4ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 23:07:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 23:07:18 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:07:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

For all cases I started with a simple python kernel and started fresh with a 
new notebook.
Java gateway is always called due to the build of the spark session object and 
passes config options to the spark submit logic, at least this is what I 
observed.



was (Author: skonto):
[~jsnowacki] What I am doing is not manual its just another legitimate way to 
start jupyter. Btw its far from manual as it works out of the box, but anyway 
the point here is Spark's config api consistency (since its a public API).
I agree the other way to start things is more pythonic and that way is very 
manual IMHO but its ok since its common practice (that is why I insisted for 
all the details).

Now I did use pip install pyspark, then `jupyter notebook` (or could just use 
plain python)
and followed your example and I got the following (some results verify what you 
have already observed):

a) setting env variable always works without caring about configuration:

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .getOrCreate()
{code}

[I 22:50:58.697 NotebookApp] Adapting to protocol v5.1 for kernel 
d05897ed-6de4-4ec2-842f-adb094bf0f0d
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 22:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 22:52:05 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 22:52:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

b)
Example 1 for me works without issues:

{code:java}
 import pyspark
spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
    .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
    .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
    .getOrCreate()
{code}

Output:
Creating new notebook in 
[I 23:03:52.055 NotebookApp] Kernel started: 
bc93a17a-e7a5-4e83-8a63-df0adba97c79
[W 23:03:52.058 NotebookApp] 404 GET 
/nbextensions/widgets/notebook/js/extension.js?v=20170818230343 (127.0.0.1) 
1.46ms 
referer=http://localhost:8888/notebooks/Untitled2.ipynb?kernel_name=python3
[I 23:04:21.361 NotebookApp] Adapting to protocol v5.1 for kernel 
bc93a17a-e7a5-4e83-8a63-df0adba97c79
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 169ms :: artifacts dl 4ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 23:04:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 23:04:22 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:04:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

c) Example 2 works as expected:

{code:java}
import pyspark

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
    .appName('test-mongo')\
    .master('local[*]')\
    .config(conf=conf)\
    .getOrCreate()
{code}

[I 23:07:13.494 NotebookApp] Creating new notebook in 
[I 23:07:13.836 NotebookApp] Kernel started: 
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:13.840 NotebookApp] 404 GET 
/nbextensions/widgets/notebook/js/extension.js?v=20170818230708 (127.0.0.1) 
2.09ms 
referer=http://localhost:8888/notebooks/Untitled3.ipynb?kernel_name=python3
[I 23:07:14.353 NotebookApp] Adapting to protocol v5.1 for kernel 
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:16.413 NotebookApp] Replacing stale connection: 
bc93a17a-e7a5-4e83-8a63-df0adba97c79:D7862E2BCD2F4111883A382D5EA7714D
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
        found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
        :: modules in use:
        org.mongodb#mongo-java-driver;3.4.2 from central in [default]
        org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in 
[default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 2 already retrieved (0kB/4ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/08/18 23:07:18 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/08/18 23:07:18 WARN Utils: Your hostname, universe resolves to a loopback 
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:07:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address

For all cases I started with a simple python kernel and started fresh with a 
new notebook.
Java gateway is always called due to the build of the spark session object and 
passes config options to the spark submit logic, at least this is what I 
observed.


> Config spark.jars.packages is ignored in SparkSession config
> ------------------------------------------------------------
>
>                 Key: SPARK-21752
>                 URL: https://issues.apache.org/jira/browse/SPARK-21752
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
>     .appName('test-mongo')\
>     .master('local[*]')\
>     .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
>     .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
>     .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
>     .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
>     .appName('test-mongo')\
>     .master('local[*]')\
>     .config(conf=conf)\
>     .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

Reply via email to