dbtsai opened a new pull request #28411:
URL: https://github.com/apache/spark/pull/28411


   ### What changes were proposed in this pull request?
   We are adding a new Spark Yarn configuration, 
`spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from 
`yarn.application.classpath` and `mapreduce.application.classpath`.
   
   ### Why are the changes needed?
   Spark Yarn client populates extra Hadoop classpath from 
`yarn.application.classpath` and `mapreduce.application.classpath` when a job 
is submitted to a Yarn Hadoop cluster.
   
   However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can 
cause jar conflicts because Spark distribution can contain different version of 
Hadoop jars.
   
   One case we have is when a user uses an Apache Spark distribution with 
its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn 
clusters, because of two different incompatible Hadoop jars in the classpath, 
it runs into errors.
   
   By not populating the Hadoop classpath from the clusters can address this 
issue.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   An UT is added, but very hard to add a new integration test since this 
requires using different incompatible versions of Hadoop.
   
   We also manually tested this PR, and we are able to submit a Spark job using 
Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating 
CDH classpath.   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to