[
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153173#comment-14153173
]
Christoph Sawade commented on SPARK-3702:
-----------------------------------------
Great initiative. I really appreciate the attempt to standardize and identify
common interfaces. Currently, I have three issues:
** Abstraction of Multilabel **
The distinguish between classification and regression seems to be natural and
also the abstraction of a multi-label makes sense to me. The simplest
multi-label approach that I can think of is a collection of binary classifiers.
Do you plan to support also mixtures of multi-labels (regression / multinomial
classification)? If so, does it makes sense to distinguish between
``MultilabelClassifier`` and ``MultilabelRegressor``? Isn't it then just a list
of Estimators?
** Model-based vs. memory-based **
I am wondering if it is worth to distinguish between memory-based (e.g.,
k-nearest neighbour, kernel-machines,...) and model-based predictions (Decision
trees, NN, Naive Bayes, GLMs)? Or more general, how does k-nearest neighbour
fit into that framework?
** Model vs. Estimator Abstraction **
Currently, the main distinction is between classification and regression.
However, many methods are similar because they have the same modelling rather
than they have the same prediction type. I am wondering how the functional
similarities can be reflected in that hierarchy. I tried to follow a bottom-up
approach and applied these abstractions to different learning methods. Here are
two examples:
Decision trees are trained with some recursive algorithm as ID3 or C4.5 and the
predicition is obtained by traversing the tree. The difference between
classification and regression plays rather a minor role. So, intuitively, there
is a DecisionTree estimator that can be, e.g., ID3 or C4.5. Then, the
DecisionTreeClassifier is a DecisionTree with classification criteria; it
returns a DecisionTree.Model (the tree) with a predictClass function
(Classifier.Model?). The DecisionTreeRegresser is a DecisionTree with
regression criteria and it returns a DecisionTree.Model with a predictScore
function (Regressor.Model?). Formally, it looks like
DecisionTree extends Estimator
DecisionTreeClassifier extends DecisionTree with Classifier
DecisionTreeRegressor extends DecisionTree with Regressor
DecisionTree.Model extends Transformer
DecisionTreeClassifier.Model extends DecisionTree.Model with Classifier.Model
DecisionTreeRegressor.Model extends DecisionTree.Model with Regressor.Model
Methods like LogReg, SVM, RidgeRegression, ... maintain a weight vector (one
probably could summarize them to GLMs). The inner product with the example
vector results naturally in a regression score for each prediction; a binary
classification is then derived by thresholding that score. The underlying
optimization problem for all consists of a sum over loss functions and a
regularization term (regularized empirical risk minimization) that can be
solved by different solvers, e.g., SGD, LBFGS... So to exploit this structure,
I would expect something like this:
RegularizedEmpiricalRiskMinimizer extends Estimator
// LogisticRegression and SupportVectorMachine could be an automatic
selection between the binomial and multinomial version
BinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
MultinomialLogisticRegression extends RegularizedEmpiricalRiskMinimizer
BinomialSupportVectorMachine extends RegularizedEmpiricalRiskMinimizer
RidgeRegression extends RegularizedEmpiricalRiskMinimizer
LinearModel extends Transformer
BinomialLinearModel extends LinearModel with Classifier.Model
MultinomialLinearModel extends LinearModel with Classifier.Model
BinomialLogisticRegression.Model extends BinomialLinearModel with
ProbabilisticClassificationModel
MultinomialLogisticRegression.Model extends MultinomialLinearModel with
ProbabilisticClassificationModel
BinomialSupportVectorMachine.Model extends BinomialLinearModel // actually it
is a binomial linear model
RidgeRegression.Model extends LinearModel // actually it is a linear model
So isn't the Classifier.Model more a trait than an abstract class? Perhaps, I
just missed something, but I think it is helpful to consider the interfaces for
specific instances. I am really interested in discussing the pros/cons.
> Standardize MLlib classes for learners, models
> ----------------------------------------------
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
> Issue Type: Sub-task
> Components: MLlib
> Reporter: Joseph K. Bradley
> Assignee: Joseph K. Bradley
> Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models
> those algorithms produce.
> Goals:
> * give intuitive structure to API, both for developers and for generated
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy |
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]