Barry created SPARK-28517:
-----------------------------
Summary: pyspark with --conf spark.jars.packages causes duplicate
jars to be uploaded
Key: SPARK-28517
URL: https://issues.apache.org/jira/browse/SPARK-28517
Project: Spark
Issue Type: Bug
Components: PySpark, YARN
Affects Versions: 2.4.3
Environment: spark 2.4.3_2.12 without hadoop
yarn 2.6
python 2.7.16
centos 7
Reporter: Barry
h2. Steps to reproduce:
{{spark-submit --master yarn --conf
"spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3"
${SPARK_HOME}/examples/src/main/python/pi.py 100}}
h2. Undesirable behavior:
warnings are printed package jars have been added to the distributed cache
multiple times
{{19/07/25 23:25:07 WARN Client: Same path resource
file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added
multiple times to distributed cache.}}
{{19/07/25 23:25:07 WARN Client: Same path resource
file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added
multiple times to distributed cache.}}
This does not happen for Scala jobs, only Pyspark
h2. Full output of example run.
{{[barryl@hostname ~]$ /opt/spark2/bin/spark-submit --master yarn --conf
"spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3"
/opt/spark2/examples/src/main/python/pi.py 100}}
{{Ivy Default Cache set to: /home/barryl/.ivy2/cache}}
{{The jars for the packages stored in: /home/barryl/.ivy2/jars}}
{{:: loading settings :: url =
jar:file:/opt/spark-2.4.3-bin-without-hadoop-scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml}}
{{org.apache.spark#spark-avro_2.12 added as a dependency}}
{{:: resolving dependencies ::
org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c;1.0}}
{{ confs: [default]}}
{{ found org.apache.spark#spark-avro_2.12;2.4.3 in central}}
{{ found org.spark-project.spark#unused;1.0.0 in central}}
{{:: resolution report :: resolve 457ms :: artifacts dl 5ms}}
{{ :: modules in use:}}
{{ org.apache.spark#spark-avro_2.12;2.4.3 from central in [default]}}
{{ org.spark-project.spark#unused;1.0.0 from central in [default]}}
{{ ---------------------------------------------------------------------}}
{{ | | modules || artifacts |}}
{{ | conf | number| search|dwnlded|evicted|| number|dwnlded|}}
{{ ---------------------------------------------------------------------}}
{{ | default | 2 | 0 | 0 | 0 || 2 | 0 |}}
{{ ---------------------------------------------------------------------}}
{{:: retrieving ::
org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c}}
{{ confs: [default]}}
{{ 0 artifacts copied, 2 already retrieved (0kB/7ms)}}
{{19/07/25 23:25:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive
is set, falling back to uploading libraries under SPARK_HOME.}}
{{19/07/25 23:25:07 WARN Client: Same path resource
file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added
multiple times to distributed cache.}}
{{19/07/25 23:25:07 WARN Client: Same path resource
file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added
multiple times to distributed cache.}}
{{19/07/25 23:25:28 WARN TaskSetManager: Stage 0 contains a task of very large
size (365 KB). The maximum recommended task size is 100 KB.}}
{{Pi is roughly 3.142308}}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]