[
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571785#comment-15571785
]
Sun Rui commented on SPARK-17904:
---------------------------------
this is a little bit tricky. You don't know the exact number of nodes of the
underlying cluster, you can't control precisely where the executors are to be
launched. You can't ensure R packages are installed on each node of the
cluster. Executors may be launched on different subset of nodes, particularly
due to dynamic allocation.
I am thinking maybe another way:
1. provide spark.installPackages() as a worker-side function instead of a
driver-side function. It actually calls R install.packages but specify the lib
location to the session temporary directory to avoid possible permission issue
and pollution issue;
1. If a user-provided R UDF will call methods from 3rd-party packages, he can
call spark.installPackages() in the R UDF.
spark.installPackages() needs a way to handle parallel installation of packages
due to tasks may run on the same worker node.
> Add a wrapper function to install R packages on each executors.
> ---------------------------------------------------------------
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
> Issue Type: New Feature
> Components: SparkR
> Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed
> into {{spark.lappy}} or {{dapply}}, they should install required R packages
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully,
> we can run similar code like following:
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L),
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party
> library, since SparkR is an interactive analytics tools, users may call lots
> of libraries during the analytics session. In native R, users can run
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can
> install dependent R packages to each executor easily?
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to
> install from local directories.
> Since SparkR has its own library directories where to install the packages on
> each executor, so I think it will not pollute the native R environment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]