[GitHub] spark pull request #14784: [SPARK-17210][SPARKR] sparkr.zip is not distribut...

zjffdu Wed, 24 Aug 2016 00:44:37 -0700

GitHub user zjffdu opened a pull request:

    https://github.com/apache/spark/pull/14784


    [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when 
running sparkr in RStudio

    ## What changes were proposed in this pull request?
    
    Spark will add sparkr.zip to archive only when it is yarn mode 
(SparkSubmit.scala).
    ```
        if (args.isR && clusterManager == YARN) {
          val sparkRPackagePath = RUtils.localSparkRPackagePath
          if (sparkRPackagePath.isEmpty) {
            printErrorAndExit("SPARK_HOME does not exist for R application in 
YARN mode.")
          }
          val sparkRPackageFile = new File(sparkRPackagePath.get, 
SPARKR_PACKAGE_ARCHIVE)
          if (!sparkRPackageFile.exists()) {
            printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R 
application in YARN mode.")
          }
          val sparkRPackageURI = 
Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
    
          // Distribute the SparkR package.
          // Assigns a symbol link name "sparkr" to the shipped package.
          args.archives = mergeFileLists(args.archives, sparkRPackageURI + 
"#sparkr")
    
          // Distribute the R package archive containing all the built R 
packages.
          if (!RUtils.rPackages.isEmpty) {
            val rPackageFile =
              RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), 
R_PACKAGE_ARCHIVE)
            if (!rPackageFile.exists()) {
              printErrorAndExit("Failed to zip all the built R packages.")
            }
    
            val rPackageURI = 
Utils.resolveURI(rPackageFile.getAbsolutePath).toString
            // Assigns a symbol link name "rpkg" to the shipped package.
            args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
          }
        }
    ``` 
    So it is necessary to pass spark.master from R process to JVM. Otherwise 
sparkr.zip won't be distributed to executor.  Besides that I also pass 
spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need 
them to access secured cluster. 
    
    ## How was this patch tested?
    
    Verify it manually in R Studio using the following code.
    ```
    Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
    .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
    library(SparkR)
    sparkR.session(master="yarn-client", sparkConfig = 
list(spark.executor.instances="1"))
    df <- as.DataFrame(mtcars)
    head(df)
    
    ```
    
    
    
    
    â¦ 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zjffdu/spark SPARK-17210

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14784.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14784
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14784: [SPARK-17210][SPARKR] sparkr.zip is not distribut...

Reply via email to