[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17904:
--------------------------------
    Description: 
SparkR provides {{spark.lappy}} to run local R functions in distributed 
environment, and {{dapply}} to run UDF on SparkDataFrame.
If users use third-party libraries inside of the function which was passed into 
{{spark.lappy}} or {{dapply}}, they should install required R packages on each 
executor in advance.
To install dependent R packages on each executors and check it successfully, we 
can run similar code like following:
{code}
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
install.packages("Matrix”))
test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
collectRDD(rdd)
{code}
It’s cumbersome to run this code snippet each time when you need third-party 
library, since SparkR is an interactive analytics tools, users may call lots of 
libraries during the analytics session. In native R, users can run 
{{install.packages()}} and {{library()}} across the interactive session.
Should we provide one API to wrapper the work mentioned above, then users can 
install dependent R packages to each executor easily? 
I propose the following API:
{{spark.installPackages(pkgs, repos)}}
* pkgs: the name of packages. If repos = NULL, this can be set with a 
local/hdfs path, then SparkR can install packages from local package archives.
* repos: the base URL(s) of the repositories to use. It can be NULL to install 
from local directories.

Since SparkR has its own library directories where to install the packages on 
each executor, so I think it will not pollute the native R environment.

  was:
SparkR provides {{spark.lappy}} to run local R functions in distributed 
environment, and {{dapply}} to run UDF on SparkDataFrame.
If users use third-party libraries inside of the function which was passed into 
{{spark.lappy}} or {{dapply}}, they should install required R packages on each 
executor in advance.
To install dependent R packages on each executors and check it successfully, we 
can run similar code like following:
{code}
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
install.packages("Matrix”))
test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
collectRDD(rdd)
{code}
It’s cumbersome to run this code snippet each time when you need third-party 
library, since SparkR is an interactive analytics tools, users may call lots of 
libraries during the analytics session. In native R, users can run 
{{install.packages()}} and {{library()}} across the interactive session.
Should we provide one API to wrapper the work mentioned above, then users can 
install dependent R packages to each executor easily? 
I propose the following API:
{{spark.installPackages(pkgs, repos)}}
* pkgs: the name of packages. If repos = NULL, this can be set with a 
local/hdfs path, then SparkR can install packages from local package archives.
* repos: the base URL(s) of the repositories to use. It can be NULL to install 
from local directories.


> Add a wrapper function to install R packages on each executors.
> ---------------------------------------------------------------
>
>                 Key: SPARK-17904
>                 URL: https://issues.apache.org/jira/browse/SPARK-17904
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to