[ 
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475645#comment-15475645
 ] 

Jeff Zhang commented on SPARK-17428:
------------------------------------

I just link the jira of python virtualenv.  It seems R support virtualenv 
natively. Install.packages can specify the version, installation dest folder. 
And it is isolated cross users. I think there's 2 scenarios for SparkR 
environment. One is cluster has internet access, another is without internet 
access.
If the cluster has internet access, then I think we can call install.packages 
directly. 
{code}
install.packages("dplyr", lib="<container_dir>")
library(dplyr, lib.loc="<container_dir>")
{code}
If the cluster doesn't have internet access, then the driver can first download 
these package tarball and add them through --files. And executor will try to 
compile and install these packages
{code}
install.packages(<pathtopackage>, repos = NULL, type="source", 
lib="<container_dir>")
library(dplyr, lib.loc="<container_dir>")
{code}
For this scenario, if the package has dependencies, it would still try to 
download its dependencies from internet. Or user has to manually figure out its 
dependencies and add them in the spark app.   


> SparkR executors/workers support virtualenv
> -------------------------------------------
>
>                 Key: SPARK-17428
>                 URL: https://issues.apache.org/jira/browse/SPARK-17428
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR
>            Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in 
> executors/workers, but SparkR can not satisfy this requirements elegantly. 
> For example, you should to mess with the IT/administrators of the cluster to 
> deploy these R packages on each executors/workers node which is very 
> inflexible.
> I think we should support third party R packages for SparkR users as what we 
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for 
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support 
> virtualenv like Python conda. I have investigated and found 
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to 
> support virtualenv for R. Packrat is a dependency management system for R and 
> can isolate the dependent R packages in its own private package space. Then 
> SparkR users can install third party packages in the application 
> scope(destroy after the application exit) and don’t need to bother 
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to