[
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133565#comment-16133565
]
Stavros Kontopoulos edited comment on SPARK-21752 at 8/18/17 8:19 PM:
----------------------------------------------------------------------
[~jsnowacki] What I am doing is not manual its just another legitimate way to
start jupyter. Btw its far from manual as it works out of the box, but anyway
the point here is Spark's config api consistency (since its a public API).
I agree the other way to start things is more pythonic and that way is very
manual IMHO but its ok since its common practice (that is why I insisted for
all the details).
Now I did use pip install pyspark, then `jupyter notebook` (or could have just
used plain python)
and followed your example and I got the following (some results verify what you
have already observed):
a) setting env variable always works without caring about configuration:
{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.getOrCreate()
{code}
[I 22:50:58.697 NotebookApp] Adapting to protocol v5.1 for kernel
d05897ed-6de4-4ec2-842f-adb094bf0f0d
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 22:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 22:52:05 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 22:52:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
b)
Example 1 for me works without issues:
{code:java}
import pyspark
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
Output:
Creating new notebook in
[I 23:03:52.055 NotebookApp] Kernel started:
bc93a17a-e7a5-4e83-8a63-df0adba97c79
[W 23:03:52.058 NotebookApp] 404 GET
/nbextensions/widgets/notebook/js/extension.js?v=20170818230343 (127.0.0.1)
1.46ms
referer=http://localhost:8888/notebooks/Untitled2.ipynb?kernel_name=python3
[I 23:04:21.361 NotebookApp] Adapting to protocol v5.1 for kernel
bc93a17a-e7a5-4e83-8a63-df0adba97c79
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 169ms :: artifacts dl 4ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 23:04:22 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 23:04:22 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:04:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
c) Example 2 works as expected:
{code:java}
import pyspark
conf = pyspark.SparkConf()
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}
[I 23:07:13.494 NotebookApp] Creating new notebook in
[I 23:07:13.836 NotebookApp] Kernel started:
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:13.840 NotebookApp] 404 GET
/nbextensions/widgets/notebook/js/extension.js?v=20170818230708 (127.0.0.1)
2.09ms
referer=http://localhost:8888/notebooks/Untitled3.ipynb?kernel_name=python3
[I 23:07:14.353 NotebookApp] Adapting to protocol v5.1 for kernel
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:16.413 NotebookApp] Replacing stale connection:
bc93a17a-e7a5-4e83-8a63-df0adba97c79:D7862E2BCD2F4111883A382D5EA7714D
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/4ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 23:07:18 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 23:07:18 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:07:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
For all cases I started with a simple python kernel and started fresh with a
new notebook.
Java gateway is always called due to the build of the spark session object and
passes config options to the spark submit logic, at least this is what I
observed.
was (Author: skonto):
[~jsnowacki] What I am doing is not manual its just another legitimate way to
start jupyter. Btw its far from manual as it works out of the box, but anyway
the point here is Spark's config api consistency (since its a public API).
I agree the other way to start things is more pythonic and that way is very
manual IMHO but its ok since its common practice (that is why I insisted for
all the details).
Now I did use pip install pyspark, then `jupyter notebook` (or could just use
plain python)
and followed your example and I got the following (some results verify what you
have already observed):
a) setting env variable always works without caring about configuration:
{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.getOrCreate()
{code}
[I 22:50:58.697 NotebookApp] Adapting to protocol v5.1 for kernel
d05897ed-6de4-4ec2-842f-adb094bf0f0d
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 22:52:05 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 22:52:05 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 22:52:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
b)
Example 1 for me works without issues:
{code:java}
import pyspark
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
Output:
Creating new notebook in
[I 23:03:52.055 NotebookApp] Kernel started:
bc93a17a-e7a5-4e83-8a63-df0adba97c79
[W 23:03:52.058 NotebookApp] 404 GET
/nbextensions/widgets/notebook/js/extension.js?v=20170818230343 (127.0.0.1)
1.46ms
referer=http://localhost:8888/notebooks/Untitled2.ipynb?kernel_name=python3
[I 23:04:21.361 NotebookApp] Adapting to protocol v5.1 for kernel
bc93a17a-e7a5-4e83-8a63-df0adba97c79
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 169ms :: artifacts dl 4ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 23:04:22 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 23:04:22 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:04:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
c) Example 2 works as expected:
{code:java}
import pyspark
conf = pyspark.SparkConf()
conf.set("spark.jars.packages",
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}
[I 23:07:13.494 NotebookApp] Creating new notebook in
[I 23:07:13.836 NotebookApp] Kernel started:
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:13.840 NotebookApp] 404 GET
/nbextensions/widgets/notebook/js/extension.js?v=20170818230708 (127.0.0.1)
2.09ms
referer=http://localhost:8888/notebooks/Untitled3.ipynb?kernel_name=python3
[I 23:07:14.353 NotebookApp] Adapting to protocol v5.1 for kernel
c61a540b-86a2-4b9e-927f-66f977b42b0f
[W 23:07:16.413 NotebookApp] Replacing stale connection:
bc93a17a-e7a5-4e83-8a63-df0adba97c79:D7862E2BCD2F4111883A382D5EA7714D
Ivy Default Cache set to: /home/stavros/.ivy2/cache
The jars for the packages stored in: /home/stavros/.ivy2/jars
:: loading settings :: url =
jar:file:/usr/local/lib/python3.5/dist-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 in central
found org.mongodb#mongo-java-driver;3.4.2 in central
:: resolution report :: resolve 160ms :: artifacts dl 3ms
:: modules in use:
org.mongodb#mongo-java-driver;3.4.2 from central in [default]
org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 from central in
[default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/4ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/08/18 23:07:18 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/08/18 23:07:18 WARN Utils: Your hostname, universe resolves to a loopback
address: 127.0.1.1; using 192.168.2.7 instead (on interface wlp2s0)
17/08/18 23:07:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
For all cases I started with a simple python kernel and started fresh with a
new notebook.
Java gateway is always called due to the build of the spark session object and
passes config options to the spark submit logic, at least this is what I
observed.
> Config spark.jars.packages is ignored in SparkSession config
> ------------------------------------------------------------
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages",
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed,
> and if I use the loaded classes, Mongo connector in this case, but it's the
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well,
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages",
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though,
> I didn't check R.
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or
> {{spark-shell}} as they create the session automatically; it this case one
> would need to use {{--packages}} option.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]