I am trying to exclude the hadoop jar dependencies from spark’s assembly files, the reason being that in order to work on our cluster it is necessary to use our now version of those files instead of the published ones. I tried define the hadoop dependencies as “provided”, but surpassingly this causes compilation errors in the build. Just to be clear, I modified the sbt build file as follows:
def yarnEnabledSettings = Seq( libraryDependencies ++= Seq( // Exclude rule required for all ? "org.apache.hadoop" % "hadoop-client" % hadoopVersion % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib), "org.apache.hadoop" % "hadoop-yarn-api" % hadoopVersion % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib), "org.apache.hadoop" % "hadoop-yarn-common" % hadoopVersion % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib), "org.apache.hadoop" % "hadoop-yarn-client" % hadoopVersion % "provided" excludeAll(excludeJackson, excludeNetty, excludeAsm, excludeCglib) ) ) and compile as SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_IS_NEW_HADOOP=true sbt assembly but the assembly still includes the hadoop libraries, contrary to what the assembly docs say. I managed to exclude them instead by using the non-recommended way: def extraAssemblySettings() = Seq( test in assembly := {}, mergeStrategy in assembly := { case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard case "log4j.properties" => MergeStrategy.discard case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines case "reference.conf" => MergeStrategy.concat case _ => MergeStrategy.first }, excludedJars in assembly <<= (fullClasspath in assembly) map { cp => cp filter {_.data.getName.contains("hadoop")} } ) But I would like to hear whether there is interest in excluding the hadoop jar by default in the build Alex Cozzi alexco...@gmail.com