[
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hossein Falaki updated SPARK-24359:
-----------------------------------
Description:
h1. Background and motivation
SparkR supports calling MLlib functionality with an [R-friendly
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
Since Spark 1.5 the (new) SparkML API which is based on [pipelines and
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
has matured significantly. It allows users build and maintain complicated
machine learning pipelines. A lot of this functionality is difficult to expose
using the simple formula-based API in SparkR.
We propose a new R package, _SparkML_, to be distributed along with SparkR as
part of Apache Spark. This new package will be built on top of SparkR’s APIs to
expose SparkML’s pipeline APIs and functionality.
*Why not SparkR?*
SparkR package contains ~300 functions. Many of these shadow functions in base
and other popular CRAN packages. We think adding more functions to SparkR will
degrade usability and make maintenance harder.
*Why not sparklyr?*
sparklyr is an R package developed by RStudio Inc. to expose Spark API to R
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge
they are not comprehensive. Also we propose a code-gen approach for this
package to minimize work needed to expose future MLlib API, but sparkly’s API
is manually written.
h1. Target Personas
* Existing SparkR users who need more flexible SparkML API
* R users (data scientists, statisticians) who wish to build Spark ML
pipelines in R
h1. Goals
* R users can install SparkML from CRAN
* R users will be able to import SparkML independent from SparkR
* After setting up a Spark session R users can
* create a pipeline by chaining individual components and specifying their
parameters
* tune a pipeline in parallel, taking advantage of Spark
* inspect a pipeline’s parameters and evaluation metrics
* repeatedly apply a pipeline
* MLlib contributors can easily add R wrappers for new MLlib Estimators and
Transformers
h1. Non-Goals
* Adding new algorithms to SparkML R package which do not exist in Scala
* Parallelizing existing CRAN packages
* Changing existing SparkR ML wrapping API
h1. Proposed API Changes
h2. Design goals
When encountering trade-offs in API, we will chose based on the following list
of priorities. The API choice that addresses a higher priority goal will be
chosen.
# *Comprehensive coverage of MLlib API:* Design choices that make R coverage
of future ML algorithms difficult will be ruled out.
* *Semantic clarity*: We attempt to minimize confusion with other packages.
Between consciousness and clarity, we will choose clarity.
* *Maintainability and testability:* API choices that require manual
maintenance or make testing difficult should be avoided.
* *Interoperability with rest of Spark components:* We will keep the R API as
thin as possible and keep all functionality implementation in JVM/Scala.
* *Being natural to R users:* Ultimate users of this package are R users and
they should find it easy and natural to use.
The API will follow familiar R function syntax, where the object is passed as
the first argument of the method: do_something(obj, arg1, arg2). All
constructors are dot separated (e.g., spark.logistic.regression()) and all
setters and getters are snake case (e.g., set_max_iter()). If a constructor
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate
to a natural pipeline style with help from very popular[ magrittr
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace
All new API will be under a new CRAN package, named SparkML. The package should
be usable without needing SparkR in the namespace. The package will introduce a
number of S4 classes that inherit from four basic classes. Here we will list
the basic types with a few examples. An object of any child class can be
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage
A pipeline object contains one or more stages.
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an
object of type Pipeline.
h2. Transformers
A Transformer is an algorithm that can transform one SparkDataFrame into
another SparkDataFrame.
*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%
set_input_col(“text”) %>%
set_output_col(“words”)
> tokenized.df <- tokenizer %>% transform(df) {code}
h2. Estimators
An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a
Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
*Example API:*
{code:java}
lr <- spark.logistic.regression() %>%
set_max_iter(10) %>%
set_reg_param(0.001){code}
h2. Evaluators
An evaluator computes metrics from predictions (model outputs) and returns a
scalar metric.
*Example API:*
{code:java}
lr.eval <- spark.regression.evaluator(){code}
h2. Miscellaneous Classes
MLlib pipelines have a variety of miscellaneous classes that serve as helpers
and utilities. For example an object of ParamGridBuilder is used to build a
grid search pipeline. Another example is ClusteringSummary.
*Example API:*
{code:java}
> grid <- param.grid.builder() %>%
add_grid(reg_param(lr), c(0.1, 0.01)) %>%
add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
> model <- train.validation.split() %>%
set_estimator(lr) %>%
set_evaluator(spark.regression.evaluator()) %>%
set_estimator_param_maps(grid) %>%
set_train-ratio(0.8) %>%
set_parallelism(2) %>%
fit(){code}
Pipeline Persistence
SparkML package will fix a longstanding issue with SparkR model persistence
SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.
*API example:*
{code:java}
> model <- pipeline %>% fit(training)
> model %>% spark.write.pipeline(overwrite = TRUE, path = “...”){code}
h1. Design Sketch
We propose using code generation from Scala to produce comprehensive API
wrappers in R. For more details please see the attached design document.
was:
h1. Background and motivation
SparkR supports calling MLlib functionality with an [R-friendly
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
Since Spark 1.5 the (new) SparkML API which is based on [pipelines and
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
has matured significantly. It allows users build and maintain complicated
machine learning pipelines. A lot of this functionality is difficult to expose
using the simple formula-based API in SparkR.
We propose a new R package, _SparkML_, to be distributed along with SparkR as
part of Apache Spark. This new package will be built on top of SparkR’s APIs to
expose SparkML’s pipeline APIs and functionality.
*Why not SparkR?*
SparkR package contains ~300 functions. Many of these shadow functions in base
and other popular CRAN packages. We think adding more functions to SparkR will
degrade usability and make maintenance harder.
*Why not sparklyr?*
sparklyr is an R package developed by RStudio Inc. to expose Spark API to R
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge
they are not comprehensive. Also we propose a code-gen approach for this
package to minimize work needed to expose future MLlib API, but sparkly’s API
is manually written.
h1. Target Personas
* Existing SparkR users who need more flexible SparkML API
* R users (data scientists, statisticians) who wish to build Spark ML
pipelines in R
h1. Goals
* R users can install SparkML from CRAN
* R users will be able to import SparkML independent from SparkR
* After setting up a Spark session R users can
* create a pipeline by chaining individual components and specifying their
parameters
* tune a pipeline in parallel, taking advantage of Spark
* inspect a pipeline’s parameters and evaluation metrics
* repeatedly apply a pipeline
* MLlib contributors can easily add R wrappers for new MLlib Estimators and
Transformers
h1. Non-Goals
* Adding new algorithms to SparkML R package which do not exist in Scala
* Parallelizing existing CRAN packages
* Changing existing SparkR ML wrapping API
h1. Proposed API Changes
h2. Design goals
When encountering trade-offs in API, we will chose based on the following list
of priorities. The API choice that addresses a higher priority goal will be
chosen.
# *Comprehensive coverage of MLlib API:* Design choices that make R coverage
of future ML algorithms difficult will be ruled out.
* *Semantic clarity*: We attempt to minimize confusion with other packages.
Between consciousness and clarity, we will choose clarity.
* *Maintainability and testability:* API choices that require manual
maintenance or make testing difficult should be avoided.
* *Interoperability with rest of Spark components:* We will keep the R API as
thin as possible and keep all functionality implementation in JVM/Scala.
* *Being natural to R users:* Ultimate users of this package are R users and
they should find it easy and natural to use.
The API will follow familiar R function syntax, where the object is passed as
the first argument of the method: do_something(obj, arg1, arg2). All
constructors are dot separated (e.g., spark.logistic.regression()) and all
setters and getters are snake case (e.g., set_max_iter()). If a constructor
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate
to a natural pipeline style with help from very popular[ magrittr
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace
All new API will be under a new CRAN package, named SparkML. The package should
be usable without needing SparkR in the namespace. The package will introduce a
number of S4 classes that inherit from four basic classes. Here we will list
the basic types with a few examples. An object of any child class can be
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage
A pipeline object contains one or more stages.
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an
object of type Pipeline.
h2. Transformers
A Transformer is an algorithm that can transform one SparkDataFrame into
another SparkDataFrame.
*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%
set_input_col(“text”) %>%
set_output_col(“words”)
> tokenized.df <- tokenizer %>% transform(df){code}
h2. Estimators
An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a
Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
*Example API:*
{code:java}
lr <- spark.logistic.regression() %>%
set_max_iter(10) %>%
set_reg_param(0.001){code}
h2. Evaluators
An evaluator computes metrics from predictions (model outputs) and returns a
scalar metric.
*Example API:*
{code:java}
lr.eval <- spark.regression.evaluator(){code}
h2. Miscellaneous Classes
MLlib pipelines have a variety of miscellaneous classes that serve as helpers
and utilities. For example an object of ParamGridBuilder is used to build a
grid search pipeline. Another example is ClusteringSummary.
*Example API:*
{code:java}
> grid <- param.grid.builder() %>%
add_grid(reg_param(lr), c(0.1, 0.01)) %>%
add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
> model <- train.validation.split() %>%
set_estimator(lr) %>%
set_evaluator(spark.regression.evaluator()) %>%
set_estimator_param_maps(grid) %>%
set_train-ratio(0.8) %>%
set_parallelism(2) %>%
fit(){code}
Pipeline Persistence
SparkML package will fix a longstanding issue with SparkR model persistence
SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.
*API example:*
{code:java}
> model <- pipeline %>% fit(training)
> model %>% spark.write.pipeline(overwrite = TRUE, path = “...”){code}
h1. Design Sketch
We propose using code generation from Scala to produce comprehensive API
wrappers in R. For more details please see the attached design document.
> SPIP: ML Pipelines in R
> -----------------------
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
> Issue Type: Improvement
> Components: SparkR
> Affects Versions: 3.0.0
> Reporter: Hossein Falaki
> Priority: Major
> Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
> Since Spark 1.5 the (new) SparkML API which is based on [pipelines and
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
> has matured significantly. It allows users build and maintain complicated
> machine learning pipelines. A lot of this functionality is difficult to
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as
> part of Apache Spark. This new package will be built on top of SparkR’s APIs
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in
> base and other popular CRAN packages. We think adding more functions to
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge
> they are not comprehensive. Also we propose a code-gen approach for this
> package to minimize work needed to expose future MLlib API, but sparkly’s API
> is manually written.
> h1. Target Personas
> * Existing SparkR users who need more flexible SparkML API
> * R users (data scientists, statisticians) who wish to build Spark ML
> pipelines in R
> h1. Goals
> * R users can install SparkML from CRAN
> * R users will be able to import SparkML independent from SparkR
> * After setting up a Spark session R users can
> * create a pipeline by chaining individual components and specifying their
> parameters
> * tune a pipeline in parallel, taking advantage of Spark
> * inspect a pipeline’s parameters and evaluation metrics
> * repeatedly apply a pipeline
> * MLlib contributors can easily add R wrappers for new MLlib Estimators and
> Transformers
> h1. Non-Goals
> * Adding new algorithms to SparkML R package which do not exist in Scala
> * Parallelizing existing CRAN packages
> * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following
> list of priorities. The API choice that addresses a higher priority goal will
> be chosen.
> # *Comprehensive coverage of MLlib API:* Design choices that make R coverage
> of future ML algorithms difficult will be ruled out.
> * *Semantic clarity*: We attempt to minimize confusion with other packages.
> Between consciousness and clarity, we will choose clarity.
> * *Maintainability and testability:* API choices that require manual
> maintenance or make testing difficult should be avoided.
> * *Interoperability with rest of Spark components:* We will keep the R API
> as thin as possible and keep all functionality implementation in JVM/Scala.
> * *Being natural to R users:* Ultimate users of this package are R users and
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as
> the first argument of the method: do_something(obj, arg1, arg2). All
> constructors are dot separated (e.g., spark.logistic.regression()) and all
> setters and getters are snake case (e.g., set_max_iter()). If a constructor
> gets arguments, they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10),
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely
> translate to a natural pipeline style with help from very popular[ magrittr
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For
> example:
> {code:java}
> > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) ->
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package
> should be usable without needing SparkR in the namespace. The package will
> introduce a number of S4 classes that inherit from four basic classes. Here
> we will list the basic types with a few examples. An object of any child
> class can be instantiated with a function call that starts with spark.
> h2. Pipeline & PipelineStage
> A pipeline object contains one or more stages.
> {code:java}
> > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
> Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is
> an object of type Pipeline.
> h2. Transformers
> A Transformer is an algorithm that can transform one SparkDataFrame into
> another SparkDataFrame.
> *Example API:*
> {code:java}
> > tokenizer <- spark.tokenizer() %>%
> set_input_col(“text”) %>%
> set_output_col(“words”)
> > tokenized.df <- tokenizer %>% transform(df) {code}
> h2. Estimators
> An Estimator is an algorithm which can be fit on a SparkDataFrame to produce
> a Transformer. E.g., a learning algorithm is an Estimator which trains on a
> DataFrame and produces a model.
> *Example API:*
> {code:java}
> lr <- spark.logistic.regression() %>%
> set_max_iter(10) %>%
> set_reg_param(0.001){code}
> h2. Evaluators
> An evaluator computes metrics from predictions (model outputs) and returns a
> scalar metric.
> *Example API:*
> {code:java}
> lr.eval <- spark.regression.evaluator(){code}
> h2. Miscellaneous Classes
> MLlib pipelines have a variety of miscellaneous classes that serve as helpers
> and utilities. For example an object of ParamGridBuilder is used to build a
> grid search pipeline. Another example is ClusteringSummary.
> *Example API:*
> {code:java}
> > grid <- param.grid.builder() %>%
> add_grid(reg_param(lr), c(0.1, 0.01)) %>%
> add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
> add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
> > model <- train.validation.split() %>%
> set_estimator(lr) %>%
> set_evaluator(spark.regression.evaluator()) %>%
> set_estimator_param_maps(grid) %>%
> set_train-ratio(0.8) %>%
> set_parallelism(2) %>%
> fit(){code}
> Pipeline Persistence
> SparkML package will fix a longstanding issue with SparkR model persistence
> SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.
> *API example:*
> {code:java}
> > model <- pipeline %>% fit(training)
> > model %>% spark.write.pipeline(overwrite = TRUE, path = “...”){code}
> h1. Design Sketch
> We propose using code generation from Scala to produce comprehensive API
> wrappers in R. For more details please see the attached design document.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]