[
https://issues.apache.org/jira/browse/SPARK-17428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15474587#comment-15474587
]
Felix Cheung commented on SPARK-17428:
--------------------------------------
Agree with above. And to be clear, packrat is still calling install.packages so
it won't be different how this is handled regarding package directory (lib
parameter to install.packages) or permission/access
https://github.com/rstudio/packrat/blob/master/R/install.R#L69
We are likely going to prefer having private packages under the application
directory in the case of YARN, so they will get clean up along with the
application.
It seems like the original point of this JIRA is around private packages and
installation/deployment - I think we would agree we could handle that (or
SparkR in YARN already can do that)
My point is though the benefit of such package management system is really with
the exact version that one can control.
But even then, building packages from source on worker machine could be
problematic (this applies both to packrat, or calls to install.packages):
https://rstudio.github.io/packrat/limitations.html
- I'm not sure we should assume all worker machines in enterprises have C
compiler or that the user running Spark have permission to build source code.
I don't know where we are at with PySpark but I'd be very interested in seeing
how that is resolved - I think both Python and R face similar constraints in
terms of deployment/package building, versioning, heterogeneous machine
architecture and so on.
> SparkR executors/workers support virtualenv
> -------------------------------------------
>
> Key: SPARK-17428
> URL: https://issues.apache.org/jira/browse/SPARK-17428
> Project: Spark
> Issue Type: New Feature
> Components: SparkR
> Reporter: Yanbo Liang
>
> Many users have requirements to use third party R packages in
> executors/workers, but SparkR can not satisfy this requirements elegantly.
> For example, you should to mess with the IT/administrators of the cluster to
> deploy these R packages on each executors/workers node which is very
> inflexible.
> I think we should support third party R packages for SparkR users as what we
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for
> each executors.
> 2, Users can load their local R packages and install them on each executors.
> To achieve this goal, the first thing is to make SparkR executors support
> virtualenv like Python conda. I have investigated and found
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to
> support virtualenv for R. Packrat is a dependency management system for R and
> can isolate the dependent R packages in its own private package space. Then
> SparkR users can install third party packages in the application
> scope(destroy after the application exit) and don’t need to bother
> IT/administrators to install these packages manually.
> I would like to know whether it make sense.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]