[
https://issues.apache.org/jira/browse/OOZIE-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376166#comment-15376166
]
Jonathan Kelly commented on OOZIE-2606:
---------------------------------------
bq. Is it replacement for --files?
No, the spark.yarn.jar(s) properties are unrelated to --files. (Btw, --files is
analogous to spark.yarn.dist.files.)
bq. Is spark.yarn.jars as a replacement of spark.yarn.jar with some additional
functionality?
spark.yarn.jar is for Spark 1.x only and is deprecated as of Spark 2.0. Using
it in 2.0 currently causes a warning to be printed, though I think it still
works (by treating it as if you used spark.yarn.jars).
spark.yarn.jars is for Spark 2.x only. The property was renamed since Spark 2.x
does not have just a single Spark assembly jar but rather a collection of jars
in the $SPARK_HOME/jars directory. In order for Spark to be launchable from a
non-standard directory structure (e.g., from inside an Oozie YARN container),
you need to set spark.yarn.jars to the list of all jars that comprise the Spark
distribution.
The paths can be either local file paths or HDFS file paths. In the first patch
I have provided, I set them to local file paths (just *.jar), but
[~satishsaley] made me realize that it would probably be better to reference
the files directly from the sharelib in HDFS. That way when each SparkAction
runs, SparkSubmit won't need to upload the jars to HDFS again (they're already
there in the sharelib) in order to add them to the DistributedCache, which will
be used when launching the Spark YARN containers.
> Set spark.yarn.jars to fix Spark 2.0 with Oozie
> -----------------------------------------------
>
> Key: OOZIE-2606
> URL: https://issues.apache.org/jira/browse/OOZIE-2606
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: 4.2.0
> Reporter: Jonathan Kelly
> Labels: spark, spark2.0.0
> Fix For: trunk
>
> Attachments: OOZIE-2606.patch
>
>
> Oozie adds all of the jars in the Oozie Spark sharelib to the
> DistributedCache such that all jars will be present in the current working
> directory of the YARN container (as well as in the container classpath).
> However, this is not quite enough to make Spark 2.0 work, since Spark 2.0 by
> default looks for the jars in assembly/target/scala-2.11/jars [1] (as if it
> is a locally built distribution for development) and will not find them in
> the current working directory.
> To fix this, we can set spark.yarn.jars to *.jar so that it finds the jars in
> the current working directory rather than looking in the wrong place. [2]
> [1]
> https://github.com/apache/spark/blob/v2.0.0-rc2/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java#L357
> [2]
> https://github.com/apache/spark/blob/v2.0.0-rc2/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L476
> Note: This property will be ignored by Spark 1.x.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)