[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

Christoph Sawade (JIRA) Tue, 30 Sep 2014 07:05:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153173#comment-14153173
 ]


Christoph Sawade edited comment on SPARK-3702 at 9/30/14 1:59 PM:
------------------------------------------------------------------

Great initiative. I really appreciate the attempt to standardize and identify 
common interfaces. Currently, I have three issues:

* Abstraction of Multilabel
The distinguish between classification and regression seems to be natural and 
also the abstraction of a multi-label makes sense to me. The simplest 
multi-label approach that I can think of is a collection of binary classifiers. 
Do you plan to support also mixtures of multi-labels (regression / multinomial 
classification)? If so, does it makes sense to distinguish between 
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list 
of Estimators?

* Model-based vs. memory-based
I am wondering if it is worth to distinguish between memory-based (e.g., 
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision 
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour 
fit into that framework?

* Model vs. Estimator Abstraction
Currently, the main distinction is between classification and regression. 
However, many methods are similar because they have the same modelling rather 
than they have the same prediction type. I am wondering how the functional 
similarities can be reflected in that hierarchy. I tried to follow a bottom-up 
approach and applied these abstractions to different learning methods. Here are 
two examples:

Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the 
predicition is obtained by traversing the tree. The difference between 
classification and regression plays rather a minor role. So, intuitively, there 
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the 
DecisionTreeClassifier is a DecisionTree with classification criteria; it 
returns a DecisionTree.Model (the tree) with a predictClass function 
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with 
regression criteria and it returns a DecisionTree.Model with a predictScore 
function (Regressor.Model?). Formally, it looks like

- DecisionTree extends Estimator
- DecisionTreeClassifier extends DecisionTree with Classifier
- DecisionTreeRegressor extends DecisionTree with Regressor

- DecisionTree.Model extends Transformer
- DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
- DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one 
probably could summarize them to GLMs). The inner product with the example 
vector results naturally in a regression score for each prediction; a binary 
classification is then derived by thresholding that score. The underlying 
optimization problem for all consists of a sum over loss functions and a 
regularization term (regularized empirical risk minimization) that can be 
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, 
I would expect something like this:

- RegularizedEmpiricalRiskMinimizer extends Estimator
  // LogisticRegression and SupportVectorMachine could be an automatic 
selection between the binomial and multinomial version
- BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
- BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
- RidgeRegression extends RegularizedEmpiricalRiskMinimizer
  
- LinearModel extends Transformer
- BinomialLinearModel extends LinearModel with Classifier.Model
- MultinomialLinearModel extends LinearModel with Classifier.Model
- BinomialLogisticRegression.Model extends BinomialLinearModel with 
ProbabilisticClassificationModel
- MultinomialLogisticRegression.Model extends MultinomialLinearModel with 
ProbabilisticClassificationModel
- BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it 
is a binomial linear model
- RidgeRegression.Model extends LinearModel // actually it is a linear model

So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I 
just missed something, but I think it is helpful to consider the interfaces for 
specific instances. I am really interested in discussing the pros/cons.


was (Author: bigcrunsh):
Great initiative. I really appreciate the attempt to standardize and identify 
common interfaces. Currently, I have three issues:

** Abstraction of Multilabel **
The distinguish between classification and regression seems to be natural and 
also the abstraction of a multi-label makes sense to me. The simplest 
multi-label approach that I can think of is a collection of binary classifiers. 
Do you plan to support also mixtures of multi-labels (regression / multinomial 
classification)? If so, does it makes sense to distinguish between 
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list 
of Estimators?

** Model-based vs. memory-based **
I am wondering if it is worth to distinguish between memory-based (e.g., 
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision 
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour 
fit into that framework?

** Model vs. Estimator Abstraction **
Currently, the main distinction is between classification and regression. 
However, many methods are similar because they have the same modelling rather 
than they have the same prediction type. I am wondering how the functional 
similarities can be reflected in that hierarchy. I tried to follow a bottom-up 
approach and applied these abstractions to different learning methods. Here are 
two examples:

Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the 
predicition is obtained by traversing the tree. The difference between 
classification and regression plays rather a minor role. So, intuitively, there 
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the 
DecisionTreeClassifier is a DecisionTree with classification criteria; it 
returns a DecisionTree.Model (the tree) with a predictClass function 
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with 
regression criteria and it returns a DecisionTree.Model with a predictScore 
function (Regressor.Model?). Formally, it looks like

  DecisionTree extends Estimator
  DecisionTreeClassifier extends DecisionTree with Classifier
  DecisionTreeRegressor extends DecisionTree with Regressor

  DecisionTree.Model extends Transformer
  DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
  DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model

Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one 
probably could summarize them to GLMs). The inner product with the example 
vector results naturally in a regression score for each prediction; a binary 
classification is then derived by thresholding that score. The underlying 
optimization problem for all consists of a sum over loss functions and a 
regularization term (regularized empirical risk minimization) that can be 
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure, 
I would expect something like this:

  RegularizedEmpiricalRiskMinimizer extends Estimator
  // LogisticRegression and SupportVectorMachine could be an automatic 
selection between the binomial and multinomial version
  BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
  MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
  BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
  RidgeRegression extends RegularizedEmpiricalRiskMinimizer
  
  LinearModel extends Transformer
  BinomialLinearModel extends LinearModel with Classifier.Model
  MultinomialLinearModel extends LinearModel with Classifier.Model
  BinomialLogisticRegression.Model extends BinomialLinearModel with 
ProbabilisticClassificationModel
  MultinomialLogisticRegression.Model extends MultinomialLinearModel with 
ProbabilisticClassificationModel
  BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it 
is a binomial linear model
  RidgeRegression.Model extends LinearModel // actually it is a linear model

So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I 
just missed something, but I think it is helpful to consider the interfaces for 
specific instances. I am really interested in discussing the pros/cons.

> Standardize MLlib classes for learners, models
> ----------------------------------------------
>
>                 Key: SPARK-3702
>                 URL: https://issues.apache.org/jira/browse/SPARK-3702
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

Reply via email to