[
https://issues.apache.org/jira/browse/OOZIE-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Satish Subhashrao Saley updated OOZIE-2606:
-------------------------------------------
Attachment: OOZIE-2606-3.patch
{quote}
1) Please use a separate method than fixFsDefaultUris
{quote}
Added a new method to set spark.yarn.jar and spark.yarn.jars. But needed
another iteration over distributed files.
{quote}
2) Please use separate if blocks for pattern matching and getting the file
SPARK_YARN_JAR_PATTERN.matcher(p.getName()).find() ||
SPARK_ASSEMBLY_JAR_PATTERN.matcher(p.getName()).find()
{quote}
added
{quote}
3) Hadoop has APIs to get version-
org.apache.hadoop.util.VersionInfo.getVersion(). Check if spark has something
similar and that can be used instead of looking at manifest directly
{quote}
We can get it through SparkContext or the one pointed out by Peter. But reading
from jar manifest for simplicity.
{quote}
4) Skip adding --conf spark.yarn.jars if it is version 1.x.
{quote}
Done
{quote}
5) Currently all jars are in --files and spark.yarn.jars which will be
confusing for user and will also generate lot of log messages saying duplicate
jar. Will it work if you just put spark-yarn*.jar in spark.yarn.jars and rest
in --files?
{quote}
Checked. It works. Now setting spark.yarn.jars to the spark-yarn*.jar only.
{quote}
Another thing is that you need to skip setting spark.yarn.jars if
spark.yarn.archive is set by the user or is present in spark-defaults.conf
{quote}
{{spark.yarn.archive}} will take precedence over {{spark.yarn.jars}} as
metioned in [spark
documentation|http://spark.apache.org/docs/latest/running-on-yarn.html]
{code}
An archive containing needed Spark jars for distribution to the YARN cache. If
set, this configuration replaces spark.yarn.jars and the archive is used in all
the application's containers. The archive should contain jar files in its root
directory. Like with the previous option, the archive can also be hosted on
HDFS to speed up file distribution.
{code}
{quote}
One more thing in patch number - 3
{quote}
Changed the patterns to detect spark-yarn*.jar and added a test case . Earlier
regex was very generic and spark-yarn-sources.jar was getting through the
filter caused issues.
> Set spark.yarn.jars to fix Spark 2.0 with Oozie
> -----------------------------------------------
>
> Key: OOZIE-2606
> URL: https://issues.apache.org/jira/browse/OOZIE-2606
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: 4.2.0
> Reporter: Jonathan Kelly
> Assignee: Satish Subhashrao Saley
> Labels: spark, spark2.0.0
> Fix For: 4.3.0
>
> Attachments: OOZIE-2606-2.patch, OOZIE-2606-3.patch, OOZIE-2606.patch
>
>
> Oozie adds all of the jars in the Oozie Spark sharelib to the
> DistributedCache such that all jars will be present in the current working
> directory of the YARN container (as well as in the container classpath).
> However, this is not quite enough to make Spark 2.0 work, since Spark 2.0 by
> default looks for the jars in assembly/target/scala-2.11/jars [1] (as if it
> is a locally built distribution for development) and will not find them in
> the current working directory.
> To fix this, we can set spark.yarn.jars to *.jar so that it finds the jars in
> the current working directory rather than looking in the wrong place. [2]
> [1]
> https://github.com/apache/spark/blob/v2.0.0-rc2/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java#L357
> [2]
> https://github.com/apache/spark/blob/v2.0.0-rc2/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L476
> Note: This property will be ignored by Spark 1.x.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)