[ https://issues.apache.org/jira/browse/SPARK-28517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936201#comment-16936201 ]
holdenk commented on SPARK-28517: --------------------------------- cc [~bryanc] / [~ifilonenko] > pyspark with --conf spark.jars.packages causes duplicate jars to be uploaded > ---------------------------------------------------------------------------- > > Key: SPARK-28517 > URL: https://issues.apache.org/jira/browse/SPARK-28517 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN > Affects Versions: 2.4.3 > Environment: spark 2.4.3_2.12 without hadoop > yarn 2.6 > python 2.7.16 > centos 7 > Reporter: Barry > Priority: Major > Labels: ivy, pyspark, yarn > > h2. Steps to reproduce: > {{spark-submit --master yarn --conf > "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" > ${SPARK_HOME}/examples/src/main/python/pi.py 100}} > h2. Undesirable behavior: > warnings are printed package jars have been added to the distributed cache > multiple times > {{19/07/25 23:25:07 WARN Client: Same path resource > file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar > added multiple times to distributed cache.}} > {{19/07/25 23:25:07 WARN Client: Same path resource > file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added > multiple times to distributed cache.}} > This does not happen for Scala jobs, only Pyspark > > h2. Full output of example run. > {{[barryl@hostname ~]$ /opt/spark2/bin/spark-submit --master yarn --conf > "spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" > /opt/spark2/examples/src/main/python/pi.py 100}} > {{Ivy Default Cache set to: /home/barryl/.ivy2/cache}} > {{The jars for the packages stored in: /home/barryl/.ivy2/jars}} > {{:: loading settings :: url = > jar:file:/opt/spark-2.4.3-bin-without-hadoop-scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml}} > {{org.apache.spark#spark-avro_2.12 added as a dependency}} > {{:: resolving dependencies :: > org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c;1.0}} > {{ confs: [default]}} > {{ found org.apache.spark#spark-avro_2.12;2.4.3 in central}} > {{ found org.spark-project.spark#unused;1.0.0 in central}} > {{:: resolution report :: resolve 457ms :: artifacts dl 5ms}} > {{ :: modules in use:}} > {{ org.apache.spark#spark-avro_2.12;2.4.3 from central in [default]}} > {{ org.spark-project.spark#unused;1.0.0 from central in [default]}} > {{ ---------------------------------------------------------------------}} > {{ | | modules || artifacts |}} > {{ | conf | number| search|dwnlded|evicted|| number|dwnlded|}} > {{ ---------------------------------------------------------------------}} > {{ | default | 2 | 0 | 0 | 0 || 2 | 0 |}} > {{ ---------------------------------------------------------------------}} > {{:: retrieving :: > org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c}} > {{ confs: [default]}} > {{ 0 artifacts copied, 2 already retrieved (0kB/7ms)}} > {{19/07/25 23:25:03 WARN Client: Neither spark.yarn.jars nor > spark.yarn.archive is set, falling back to uploading libraries under > SPARK_HOME.}} > {{19/07/25 23:25:07 WARN Client: Same path resource > file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar > added multiple times to distributed cache.}} > {{19/07/25 23:25:07 WARN Client: Same path resource > file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added > multiple times to distributed cache.}} > {{19/07/25 23:25:28 WARN TaskSetManager: Stage 0 contains a task of very > large size (365 KB). The maximum recommended task size is 100 KB.}} > {{Pi is roughly 3.142308}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org