[ https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592272#comment-15592272 ]
Piotr Smolinski commented on SPARK-17904: ----------------------------------------- Would it work at all? I have been looking recently on SparkR implementation. ATM, on the executor side all dapply/gapply/spark.lappy are single shot operations. Executor JVM either forks preallocated small daemon process or launches new R runtime (windows or when daemon is explicitly disabled) only for duration of the call. This process is immediately disposed once task is done. That means there is no R runtime that can be preinitialized. Check: https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R > Add a wrapper function to install R packages on each executors. > --------------------------------------------------------------- > > Key: SPARK-17904 > URL: https://issues.apache.org/jira/browse/SPARK-17904 > Project: Spark > Issue Type: New Feature > Components: SparkR > Reporter: Yanbo Liang > > SparkR provides {{spark.lappy}} to run local R functions in distributed > environment, and {{dapply}} to run UDF on SparkDataFrame. > If users use third-party libraries inside of the function which was passed > into {{spark.lappy}} or {{dapply}}, they should install required R packages > on each executor in advance. > To install dependent R packages on each executors and check it successfully, > we can run similar code like following: > (Note: The code is just for example, not the prototype of this proposal. The > detail implementation should be discussed.) > {code} > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), > install.packages("Matrix”)) > test <- function(x) { "Matrix" %in% rownames(installed.packages()) } > rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test ) > collectRDD(rdd) > {code} > It’s cumbersome to run this code snippet each time when you need third-party > library, since SparkR is an interactive analytics tools, users may call lots > of libraries during the analytics session. In native R, users can run > {{install.packages()}} and {{library()}} across the interactive session. > Should we provide one API to wrapper the work mentioned above, then users can > install dependent R packages to each executor easily? > I propose the following API: > {{spark.installPackages(pkgs, repos)}} > * pkgs: the name of packages. If repos = NULL, this can be set with a > local/hdfs path, then SparkR can install packages from local package archives. > * repos: the base URL(s) of the repositories to use. It can be NULL to > install from local directories. > Since SparkR has its own library directories where to install the packages on > each executor, so I think it will not pollute the native R environment. I'd > like to know whether it make sense, and feel free to correct me if there is > misunderstanding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org