[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

Piotr Smolinski (JIRA) Thu, 20 Oct 2016 09:25:16 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592272#comment-15592272
 ]


Piotr Smolinski commented on SPARK-17904:
-----------------------------------------

Would it work at all? I have been looking recently on SparkR implementation. 
ATM, on the executor side all dapply/gapply/spark.lappy are single shot 
operations. Executor JVM either forks preallocated small daemon process or 
launches new R runtime (windows or when daemon is explicitly disabled) only for 
duration of the call. This process is immediately disposed once task is done. 
That means there is no R runtime that can be preinitialized. 
Check: https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R

> Add a wrapper function to install R packages on each executors.
> ---------------------------------------------------------------
>
>                 Key: SPARK-17904
>                 URL: https://issues.apache.org/jira/browse/SPARK-17904
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

Reply via email to