GitHub user zjffdu opened a pull request:
https://github.com/apache/spark/pull/14784
[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when
running sparkr in RStudio
## What changes were proposed in this pull request?
Spark will add sparkr.zip to archive only when it is yarn mode
(SparkSubmit.scala).
```
if (args.isR && clusterManager == YARN) {
val sparkRPackagePath = RUtils.localSparkRPackagePath
if (sparkRPackagePath.isEmpty) {
printErrorAndExit("SPARK_HOME does not exist for R application in
YARN mode.")
}
val sparkRPackageFile = new File(sparkRPackagePath.get,
SPARKR_PACKAGE_ARCHIVE)
if (!sparkRPackageFile.exists()) {
printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R
application in YARN mode.")
}
val sparkRPackageURI =
Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
// Distribute the SparkR package.
// Assigns a symbol link name "sparkr" to the shipped package.
args.archives = mergeFileLists(args.archives, sparkRPackageURI +
"#sparkr")
// Distribute the R package archive containing all the built R
packages.
if (!RUtils.rPackages.isEmpty) {
val rPackageFile =
RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get),
R_PACKAGE_ARCHIVE)
if (!rPackageFile.exists()) {
printErrorAndExit("Failed to zip all the built R packages.")
}
val rPackageURI =
Utils.resolveURI(rPackageFile.getAbsolutePath).toString
// Assigns a symbol link name "rpkg" to the shipped package.
args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
}
}
```
So it is necessary to pass spark.master from R process to JVM. Otherwise
sparkr.zip won't be distributed to executor. Besides that I also pass
spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need
them to access secured cluster.
## How was this patch tested?
Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig =
list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)
```
â¦
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zjffdu/spark SPARK-17210
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14784.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14784
----
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]